CDR assignment

CDR assignment

In the previous document, we described how we created the clusters. Here we will go into assigning CDR clusters to PDB’s and sequences.

Assignment to “outlier cluster”

In the CDR clustering document, it is mentioned that in 9 out of 20 CDR groups, an “outlier cluster” was found. With this we mean, that there are a bunch of CDR structures that don’t fit in the well defined cluster in one way or another, and therefore fall in the outlier cluster (which is not an actual cluster). Structures in this cluster are not actually structurally similar at all.

Now when it comes to assigning, we do train our model on the outlier cluster as well, but only if there is any. In a CDR group where there are outliers, our model will do a decent job at assigning a CDR structure which doesn’t fit in any of the real cluster, to be an outlier. However, in a CDR group without outliers, a CDR structure will always be assigned to a cluster and never labeled as outlier. Since we haven’t seen any outliers in the training data, we assume this won’t be problem.

PDB - CDR assignment

Going from the structure to the cluster is quite a straightforward task, as the cluster is created based on the structure. We found that a simple random forest based on the coordinates of the C-alpha atoms (without torsian angles) had an almost perfect performance (see result below). This includes the C-alpha atoms of the CDR and three of the neighboring residues as well.

CDR group

cohen kappa score*

accuracy

CDR group

cohen kappa score*

accuracy

L1_5

1

1

L1_6

0.971

0.993

L1_7

0.986

0.992

L1_8

0.997

0.998

L1_9

0.963

0.974

L1_10

0.869

0.968

L1_11

0.991

0.996

L2_3

0.96

0.995

L2_7

1

1

L3_9

0.978

0.996

L3_10

0.957

0.97

L3_11

0.943

0.955

L3_8

0.93

0.98

L3_12

0.984

0.991

H1_8

0.935

0.963

H1_9

0.977

0.986

H1_10

0.963

0.982

H2_7

0.929

0.973

H2_8

0.962

0.982

H2_9

0.95

0.976

H2_10

0.955

0.979

Note: these results are based on 5-fold cross validation. The final model is trained on the full dataset and therefore expected to perform somewhat better

Sequence - CDR assignment

The CDR assignment based on the sequence alone is obviously a more difficult task. A few years back we did this based on HMM-profiles, but after checking the alternatives, we found other models to be more performant. We ended up “encoding” the amino acids to their hydrophobicity and mass and classifying using CatBoost, with very satisfying results - see the table below.

CDR group

cohen kappa score*

accuracy

CDR group

cohen kappa score*

accuracy

L1_5

0.920931

0.964789

L1_6

0.954033

0.989643

L1_7

0.776641

0.879679

L1_8

0.981113

0.990164

L1_9

0.910368

0.938525

L1_10

0.852702

0.963283

L1_11

0.928727

0.966951

L2_3

0.850165

0.983323

L2_7

0.961385

0.982759

L3_8

0.766381

0.938776

L3_9

0.940029

0.988933

L3_10

0.879231

0.914414

L3_11

0.843214

0.878205

L3_12

0.941965

0.968481

H1_8

0.813099

0.895833

H1_9

0.844344

0.909836

H1_10

0.968396

0.98481

H2_7

0.846967

0.941654

H2_8

0.913255

0.959522

H2_9

0.975369

0.988235

H2_10

0.96733

0.984334

Note: these results are based on 5-fold cross validation. The final model is trained on the full dataset and therefore expected to perform somewhat better

*The Cohen Kappa score is a powerful metric when looking at the performance of a multiclass classifier. It takes class imbalances well into account, which is why in some cases there a quite a difference with accuracy.