CDR conformational clustering and sequence-based assignment

Introduction and objective

The complementary determining regions (CDRs) determine the selectivity of the antibody. Every antibody has a light and heavy chain, and every chain has three CDRs of varying lengths. In this document, we will dive deeper into a newly developed method to find structurally conserved clusters within each combination of chain type, CDR number, and CDR length combination (from now on called a CDR group). In addition, we trained a machine learning classifier to assign new sequences to one of the CDR clusters. These CDR clusters aid researchers to gain insight into how CDRs work and how they relate to each other. Assigning antibody sequences to these CDR clusters can assist researchers in drug development and further antibody research.

The data

Since labelled data is sparsely available, this problem becomes an unsupervised machine learning problem. Therefore the input data has a large impact on the final results. We decided to take the SAbDab antibody database as input for our clustering pipeline, to ensure that only high-quality structures are included. Subsequently, this dataset was filtered on a resolution smaller than 5Å, to completely ensure high data quality. Lastly, when available, we used the PDD_REDO structure data to optimise the structure data in the clustering pipeline.

Methods

Extracting data

ANARCI was used to obtain the Chothia numbering of the antibodies. Next, the IMGT definition of the CDRs was used to extract the CDR structures from the antibody PDB files. In addition to the CDR residues, we extracted seven residues on both sides of the CDRs, referred to as the anchors. These anchors are sequentially and structurally more conserved and therefore useful to use as (literal) anchor points for extracting structural features for clustering.

All the CDRs were grouped by chain type (light or heavy), CDR number (1, 2, or 3) and CDR length (between 3 and 26), resulting in groups such as H1_8 (heavy chain, CDR number 1, length 8). All CDRs within each group are iteratively superposed. The superpositioning is only based on the anchors, due to their conserved sequence. In addition, using the anchors creates clear structural differences that are easier to detect.

Note: The CDR group labelling is similar to Dunbrack’s approach. Nevertheless, due to different CDR definitions, our H1_8 is therefore different to Dunbrack’s H1-8.

Collecting features

For each residue in the obtained CDRs, the torsion angles (φ and ψ), and the coordinates (x, y and z) of the alpha carbons were extracted. Unfortunately, machine learning algorithms compute the distance between a 5-degree and a 355-degree angle as 350 degrees, rather than 10 degrees. Therefore, both the sine and cosine of the torsion angles are used to ensure the correct relative angular distance is computed. Eventually, for every residue we end up with seven features: coordinates x, y, z, and torsion angles sin(φ), cosine(φ), sin(ψ), and cosine(ψ).

Info: The dimensions of the coordinate features and angle features differ. The clustering pipeline will select the coordinates, the angles or both of them. Due to the difference in dimensions, a scaler is added as optional parameter (e.g. a MinMaxScaler).

Clustering pipeline

We created a pipeline with several data processing steps, followed by the clustering algorithm to get the best clustering results. Subsequently, we perform a grid search with this pipeline with the different features, collected above, as well as an optional scaler and dimensionality reduction. This results in a large number of different clusters per CDR group. The best clustering result per CDR group is selected based on scoring and manual inspection.

The steps in the pipeline are as follows:

  • Features selection: either use the coordinates, angles or both

  • Scale: either MinMaxScaler, StandardScaler, or none.

  • Dimensionality reduction: either PCA, UMAP, or none.

  • HDBSCAN

Within this pipeline, we used the above parameters in the grid search. For example, the number of components in the dimensionality reduction techniques or the minimal cluster size in HDBSCAN.

Note: The best clustering for each CDR group might result from a pipeline with different parameters. For example, the H1_8 clusters could have been created by a pipeline that only takes the coordinates, no scaling on the data and no dimensionality reduction, while the H1_9 pipeline takes both the coordinates and angles, applies the MinMaxScaler and uses PCA to reduce dimensionality.

Scoring

The Calinski-Harabasz score performed best in determining the best clustering. Nevertheless, this was still insufficient to select the best clusters per CDR group. Manual inspection of the clusters provided the best clustering and was mainly used. Details on the manual inspection are explained below.

Manual inspection

The manual inspection involved visually comparing the several different CDRs of each cluster. We checked how they were different if they were different enough from each other, and if there wasn’t too much variety within a cluster. Additionally, we cross-referenced our findings with the sequence logos of Ramachandran classes to confirm our selections. See the figure below.

A) Example of a cartoon display of the backbone of two different clusters. B) Visual explanation of the different “Ramachandran classes”, which are used in the sequence logo. C) An example of sequence logos of the Ramachandran classes. In this case, positions 8 and 10 are the distinguishing features between the two clusters.

The performance of the clustering algorithm HDBSCAN occasionally encounters limitations in identifying clusters. This observation often serves as an indicator of the absence of an inherent cluster. Nevertheless, in the context of our extensive grid search, which includes a diverse array of data processing steps, HDBSCAN tends to be able to generate clustering results. The manual inspection step is essential to filter out the CDR groups where there isn’t much difference between the clusters. In addition, two clusters are merged when the CDR groups are similar.

Results

We found the following number of clusters of each of the different CDR groups. For each CDR group, the clustering pipeline created multiple different ways to cluster the group. As elaborated above, the best set of features and cluster parameters for a specific CDR group was determined by manual inspection of the clustering result.

 

The clusters are always labelled in order of size. Hence, if a CDR group has 3 clusters, the biggest cluster will be cluster 1 and the smallest will be cluster 3. An outlier cluster, as explained above, is always annotated as cluster 0.

Summary

To summarize, this research effectively identified distinct structurally conserved clusters within CDR groups in antibodies, providing insights into their structural diversity and functions. By implementing a clustering methodology and the manual inspection process, the study contributed to a deeper understanding of the diversity present in CDRs. We hope to aid in their importance in drug development and antibody research. Thereby, highlighting the practical significance of the identified CDR clusters for understanding antibody mechanisms and their potential therapeutic applications.

In addition to the CDR clusters, we trained a machine-learning classifier to assign structures and sequences to one of the CDR clusters.

Cluster assignment for new sequences and structures

After creating the different clusters per CDR group, we developed a machine-learning classifier to assign new structures and sequences to these CDR clusters.

Structure assignment

The CDR clusters are created on the structures and therefore a random forest classifier is able to assign a new structure to a certain CDR cluster. The random forest uses only the coordinates of the C-alpha atoms. The coordinates include the C-alpha atoms of the CDR and three additional neighbouring residues, but no torsion angles are used. The random forest classifier performs nearly perfectly as shown in the results below.

Sequence assignment

The CDR assignment based on the sequence alone is a more difficult task. Previously, this task was accomplished using HMM profiles. Investigating other techniques, another model was found to perform better. We trained CatBoost with encoded amino acids to their hydrophobicity and mass. The results are shown below.

Assignment to “outlier cluster”

In our CDR clustering, in 9 out of 20 CDR groups, an “outlier cluster” was found. With this we mean that the CDR structures that don't fit in the well-defined CDR clusters are grouped together as the outlier cluster. It is important to note that structures in this outlier cluster are not structurally similar.

For cluster assignment, the model is trained on all data, including this outlier cluster. In a CDR group containing such an outlier cluster, the model performs reasonable at assigning an outlier CDR structure which doesn’t fit in any of the real clusters, to the outlier cluster. However, some CDR groups don't have an outlier cluster. In this case, a CDR structure will always be assigned to a cluster and never be labelled as an outlier.

Conclusion

In conclusion, this research successfully identified distinct CDR clusters within antibodies, providing valuable insights into their structural diversity and functional characteristics. Additionally, we developed a robust machine-learning classifier to assign new structures or sequences to CDR clusters. This was done by utilizing a random forest approach for structure assignment and a CatBoost model for sequence assignment. The identification of "outlier clusters" in certain CDR groups emphasized the presence of structurally divergent CDR structures. By training the model on data inclusive of these outliers, the research ensured effective cluster assignments, contributing to a comprehensive understanding of CDRs and their implications in antibody research and drug development.