CDR assignment
In the previous document, we described how we created the clusters. Here we will go into assigning CDR clusters to PDB’s and sequences.
Assignment to “outlier cluster”
In the CDR clustering document, it is mentioned that in 9 out of 20 CDR groups, an “outlier cluster” was found. With this we mean, that there are a bunch of CDR structures that don’t fit in the well defined cluster in one way or another, and therefore fall in the outlier cluster (which is not an actual cluster). Structures in this cluster are not actually structurally similar at all.
Now when it comes to assigning, we do train our model on the outlier cluster as well, but only if there is any. In a CDR group where there are outliers, our model will do a decent job at assigning a CDR structure which doesn’t fit in any of the real cluster, to be an outlier. However, in a CDR group without outliers, a CDR structure will always be assigned to a cluster and never labeled as outlier. Since we haven’t seen any outliers in the training data, we assume this won’t be problem.
PDB - CDR assignment
Going from the structure to the cluster is quite a straightforward task, as the cluster is created based on the structure. We found that a simple random forest based on the coordinates of the C-alpha atoms (without torsian angles) had an almost perfect performance (see result below). This includes the C-alpha atoms of the CDR and three of the neighboring residues as well.
CDR group | cohen kappa score* | accuracy |
---|---|---|
L1_5 | 1 | 1 |
L1_6 | 0.971 | 0.993 |
L1_7 | 0.986 | 0.992 |
L1_8 | 0.997 | 0.998 |
L1_9 | 0.963 | 0.974 |
L1_10 | 0.869 | 0.968 |
L1_11 | 0.991 | 0.996 |
L2_3 | 0.96 | 0.995 |
L2_7 | 1 | 1 |
L3_9 | 0.978 | 0.996 |
L3_10 | 0.957 | 0.97 |
L3_11 | 0.943 | 0.955 |
L3_8 | 0.93 | 0.98 |
L3_12 | 0.984 | 0.991 |
H1_8 | 0.935 | 0.963 |
H1_9 | 0.977 | 0.986 |
H1_10 | 0.963 | 0.982 |
H2_7 | 0.929 | 0.973 |
H2_8 | 0.962 | 0.982 |
H2_9 | 0.95 | 0.976 |
H2_10 | 0.955 | 0.979 |
Note: these results are based on 5-fold cross validation. The final model is trained on the full dataset and therefore expected to perform somewhat better
Sequence - CDR assignment
The CDR assignment based on the sequence alone is obviously a more difficult task. A few years back we did this based on HMM-profiles, but after checking the alternatives, we found other models to be more performant. We ended up “encoding” the amino acids to their hydrophobicity and mass and classifying using CatBoost, with very satisfying results - see the table below.
CDR group | cohen kappa score* | accuracy |
---|---|---|
L1_5 | 0.920931 | 0.964789 |
L1_6 | 0.954033 | 0.989643 |
L1_7 | 0.776641 | 0.879679 |
L1_8 | 0.981113 | 0.990164 |
L1_9 | 0.910368 | 0.938525 |
L1_10 | 0.852702 | 0.963283 |
L1_11 | 0.928727 | 0.966951 |
L2_3 | 0.850165 | 0.983323 |
L2_7 | 0.961385 | 0.982759 |
L3_8 | 0.766381 | 0.938776 |
L3_9 | 0.940029 | 0.988933 |
L3_10 | 0.879231 | 0.914414 |
L3_11 | 0.843214 | 0.878205 |
L3_12 | 0.941965 | 0.968481 |
H1_8 | 0.813099 | 0.895833 |
H1_9 | 0.844344 | 0.909836 |
H1_10 | 0.968396 | 0.98481 |
H2_7 | 0.846967 | 0.941654 |
H2_8 | 0.913255 | 0.959522 |
H2_9 | 0.975369 | 0.988235 |
H2_10 | 0.96733 | 0.984334 |
Note: these results are based on 5-fold cross validation. The final model is trained on the full dataset and therefore expected to perform somewhat better
*The Cohen Kappa score is a powerful metric when looking at the performance of a multiclass classifier. It takes class imbalances well into account, which is why in some cases there a quite a difference with accuracy.