In the previous document, we described how we created the clusters. Here we will go into assigning CDR clusters to PDB’s and sequences.

Assignment to “outlier cluster”

In the CDR clustering document, it is mentioned that in 9 out of 20 CDR groups, an “outlier cluster” was found. With this we mean, that there are a bunch of CDR structures that don’t fit in the well defined cluster in one way or another, and therefore fall in the outlier cluster (which is not an actual cluster). Structures in this cluster are not actually structurally similar at all.

Now when it comes to assigning, we do train our model on the outlier cluster as well, but only if there is any. In a CDR group where there are outliers, our model will do a decent job at assigning a CDR structure which doesn’t fit in any of the real cluster, to be an outlier. However, in a CDR group without outliers, a CDR structure will always be assigned to a cluster and never labeled as outlier. Since we haven’t seen any outliers in the training data, we assume this won’t be problem.

PDB - CDR assignment

Going from the structure to the cluster is quite a straightforward task, as the cluster is created based on the structure. We found that a simple random forest based on the coordinates of the C-alpha atoms (without torsian angles) had an almost perfect performance (see result below). This includes the C-alpha atoms of the CDR and three of the neighboring residues as well.

CDR group	cohen kappa score*	accuracy

CDR group	cohen kappa score*	accuracy
L1_5	1	1
L1_6	0.971	0.993
L1_7	0.986	0.992
L1_8	0.997	0.998
L1_9	0.963	0.974
L1_10	0.869	0.968
L1_11	0.991	0.996
L2_3	0.96	0.995
L2_7	1	1
L3_9	0.978	0.996
L3_10	0.957	0.97
L3_11	0.943	0.955
L3_8	0.93	0.98
L3_12	0.984	0.991
H1_8	0.935	0.963
H1_9	0.977	0.986
H1_10	0.963	0.982
H2_7	0.929	0.973
H2_8	0.962	0.982
H2_9	0.95	0.976
H2_10	0.955	0.979

Note: these results are based on 5-fold cross validation. The final model is trained on the full dataset and therefore expected to perform somewhat better

Sequence - CDR assignment

The CDR assignment based on the sequence alone is obviously a more difficult task. A few years back we did this based on HMM-profiles, but after checking the alternatives, we found other models to be more performant. We ended up “encoding” the amino acids to their hydrophobicity and mass and classifying using CatBoost, with very satisfying results - see the table below.

CDR group	cohen kappa score*	accuracy

CDR group	cohen kappa score*	accuracy
L1_5	0.920931	0.964789
L1_6	0.954033	0.989643
L1_7	0.776641	0.879679
L1_8	0.981113	0.990164
L1_9	0.910368	0.938525
L1_10	0.852702	0.963283
L1_11	0.928727	0.966951
L2_3	0.850165	0.983323
L2_7	0.961385	0.982759
L3_8	0.766381	0.938776
L3_9	0.940029	0.988933
L3_10	0.879231	0.914414
L3_11	0.843214	0.878205
L3_12	0.941965	0.968481
H1_8	0.813099	0.895833
H1_9	0.844344	0.909836
H1_10	0.968396	0.98481
H2_7	0.846967	0.941654
H2_8	0.913255	0.959522
H2_9	0.975369	0.988235
H2_10	0.96733	0.984334

Note: these results are based on 5-fold cross validation. The final model is trained on the full dataset and therefore expected to perform somewhat better

*The Cohen Kappa score is a powerful metric when looking at the performance of a multiclass classifier. It takes class imbalances well into account, which is why in some cases there a quite a difference with accuracy.