CDR clustering
Introduction and objective
The complementary determining regions (CDR’s) of antibodies determine the selectivity of the antibody. Every antibody has a light and heavy chain and every chain has three CDR’s, very creatively labelled CDR 1, 2 and 3. Besides the division based on chain type and CDR number, the length of the CDR also varies. In this document, we will dive deeper into the method we developed to find clusters within each chain type, CDR number, CDR length combination (from now on CDR group). Our clusters can aid researchers to gain more insight in how CDR’s work and how they relate to each other, which can assist in drug development and antibody research.
The data
Since it’s an unsupervised machine learning problem, the input data has a large impact on the final results. Therefore, we decided to take the SAbDab antibody database as input for our clustering pipeline, so only high quality structures are included. Subsequently, this dataset was filtered on a resolution smaller than 5Å, to completely ensure the quality of the data. Lastly, when available, we used the PDB_REDO structure.
Method
Extracting data
First, we used ANARCI to obtain the chothia numbering of the antibodies and then used the IMGT definition of the CDR’s to extract the CDR structures from the antibody PDB files. Besides just the CDR’s, we also extracted seven residues (from now on referred to as anchors) on both sides of the CDR’s.
All the CDR’s were grouped by chain type (light or heavy), CDR number (1, 2 or 3) and CDR length (between 3 and 26), resulting in groups like H1_8 (heavy chain, cdr number 1, length 8). All CDR’s within each group are superposed on each other, where the superpositioning is based on the anchors only. This is done so the structural differences are better detectable and visible, and because the anchors are sequentially and structurally more conserved.
Note: The CDR group labelling is similar to what Dunbirck did. However, since we use different CDR definitions, our H1_8 doesn’t match their H1-8.
Features
From the obtained CDR structures, for each residue we extract the torsion angles (φ and ψ), and the coordinates (x, y and z) of the alpha carbons. Unfortunately, machine learning algorithms cannot deal with angles very well; they compute the distance between a 5 degree and a 355 degree angle as 350 degree, rather than 10 degree. To make sure distances are computed properly, we have to both take the sine and cosine of the torsion angles. Eventually, for every residue we end up with seven features: coordinates x, y, z and torsion angles sin(φ), cosine(φ), angles sin(ψ), cosine(ψ).
Info: The dimensions of the coordinate features and angle features are completely different. The clustering pipeline will either select exclusively the coordinates or exclusively the angles - or the entire feature dataset will be scaled (e.g. by a MinMaxScaler).
Clustering pipeline
In order to get the best clustering results, we choose to make a pipeline with several data processing steps, followed by the clustering algorithm. We then perform a grid search with this pipeline with many different parameters, resulting in a large amount of different clustering results per CDR group. The best clustering result per CDR group is then selected based on the score and manual inspection.
The steps in the pipeline:
Features selection: either use exclusively the coordinates, the angles or use both
Scale: either don’t scale, MinMaxScaler or StandardScaler
Dimensionality reduction: either don’t reduce dimensionality, PCA or UMAP
HDBSCAN
Within this pipeline, we also tried different parameters that were used in the grid search, such as the number of components in the different dimensionality reduction techniques and the minimal cluster size in HDBSCAN.
Note: The best clustering for each CDR group might have been the result of a different pipeline. So the H1_8 clusters could have been created by a pipeline that only takes the coordinates, doesn’t scale the data and doesn’t apply any dimensionality reduction, while the H1_9 pipeline takes both the coordinates and angles, applies the MinMaxScaler and uses PCA to reduce dimensionality.
Scoring
We were not able to determine or create a scoring method that worked well to pick the best clustering result. We found that the Calinski and Harabasz score was the most useful metric to help us pick the best clustering, but the best clustering was always determined manually.
Manual inspection
Manual selection involved visually comparing the clusters by looking at several CDR structures of each cluster. We checked how they were different, if they were different enough from each other, and if they wasn’t too much variety within a cluster. Additionally, we cross-referenced our findings with the sequence logos of ramachandran classes to confirm our selections. See the figure below.
The clustering algorithm (HDBSCAN) isn’t always able to find clusters, which is a good indication there aren’t any; however, in our grid search we try so many different data processing steps, that in the end, it will always find some kind of clustering. This visual inspection step is essential to filter out the CDR groups where there isn’t much difference between the clusters. Also, in CDR groups where there are real clusters, sometimes two clusters were merged together as they were too similar.
Results
We found the following number of clusters of each of the different CDR groups. Per CDR group, the clustering pipeline created a lot of different ways to cluster the group, but we then handpicked the best way.
CDR group | number of clusters |
---|---|
H1_8 | 4 |
H1_9 | 3 |
H1_10 | 3 |
H2_7 | 2 |
H2_8 | 2 |
H2_9 | 2 |
H2_10 | 3 |
L1_5 | 2 |
L1_6 | 2 |
L1_7 | 4 |
L1_8 | 4 |
L1_9 | 5 |
L1_10 | 2 |
L1_11 | 2 |
L2_3 | 3 |
L2_7 | 2 |
L3_8 | 3 |
L3_9 | 2 |
L3_10 | 6 |
L3_11 | 8 |
L3_12 | 3 |
Note: for 9 out of the 21 CDR groups, we also observed an “outlier cluster”. A group of structures that didn’t belong to any cluster. This outlier cluster is not a real cluster, hence not counted in the number of clusters.
Annotation
The clusters are always labelled in order by size. So if a CDR group has 3 clusters, the biggest cluster will be cluster 1 and the smallest will be cluster 3. If there is an “outlier cluster” (which isn’t a real cluster), this one is always annotated as cluster 0.