Most Structurally Hostile Cross-Validation (MSH-CV)

Introduction

To test the generalisability of an in silico model (either QSAR or similar molecule-activity predictive model), the best option is generating a new data set by running a new series of assays: it is also generally infeasible. Method which would allow a simulated series of molecules to serve as stand-in for new data would be very useful here.

One subgoal of generalisability testing is to estimate the sensitivity of the model to change of structure. For example, it is common to have QSAR models that overfit through structural side-channels, when one is not very careful about curation of the input dataset. A model with sufficient expressive power (eg a neural network) can identify chemical classes (which by and large will have similar activity profile, at least more similar than two molecules taken at random) through indirect means like “number of acyclic, three valent atoms”-type descriptor. This artificially increases its performance at the cost of generalisability to new chemical classes, and may not be obvious at first glance (do lead-optimised molecules have fewer acyclic three valent atom ? I sure don’t know but that model is gonna find out if so).

This particular problem is what led me to this idea of Most Structurally Hostile Cross-Validation (MSH-CV).

Genesis of the protocol

A first idea I had was simulating this by looking at the training dataset, finding a chemical class with non-negligible size and sufficient diversity of activity, training on the rest of the data, then testing on this structurally unrelated class to assess the performance.

While this would go some way to convince me that the model was not overfitting too much (and thus is a good back-of-the-envelope-type test when exploring model architecture), it would not have much persuasive power on bystanders or reviewers : what if I had cherry-picked this particular chemical class ? What if this is not actually that different from the rest of the training set (some other class may be more starkly different) ? What if it works for this class, but not necessarily any other ?

This approach would need to be formalised so that :

  • The choice of the test chemical class is automated
  • Several classes can be tested
  • Few to no tunable parameters (nothing up my sleeve methodology)
  • The test/train groups of molecules have to be maximally different (to the extent permitted by the original dataset)

Proposed procedure for MSH-CV

  1. Calculate molecular fingerprints
  2. Calculate Tanimoto distance matrix (1 - similarity)
  3. Clustering of the molecules
  4. Isolate N clusters
  5. Cross validate: train on N-1 clusters, test on Nth clusters

Implementation note

Morgan fingerprints (ECFP) seem appropriate here, as it will cluster similar scaffolds (ie, same chemical class, loosely) together. Moiety-based fingerprints are less appropriate, but may be useful in some cases (which I’m not sure about ?).

Hierarchical clustering seems appropriate here because it takes not parameters. If intractable, k-means will likely do, but it’s a little bit more fiddly so less convincing to the outside observer (but not by much). Complete linkage method has worked well for me.

N can be 5. Excluding the clusters with either all active or all inactive allow one to calculate AUC ROC and MCC, which I like. It also makes the task more difficult for the model, it seems, but I do not have formal numbers on that.

Python code

Using RDKit and Sci-Kit Learn, an example of L1-regularized logistic regression under MSH-CV. You will have to modify this snippet anyway, so no guarantee of fitness for purpose.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Descriptors
from rdkit.Chem import Crippen
from rdkit import DataStructs

from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import roc_curve, auc
from sklearn import linear_model
from sklearn.preprocessing import StandardScaler

import scipy.interpolate as interp


### Data you have to provide :

X_train_smiles = ... #Contains the smiles of the training set
X_train  # The inputs to the modem
X_test
y_train # The output of the model : the classes or activity or whatever
y_test


scaler_x = StandardScaler()
scaler_x.fit(X_train)

X_train_scaled = scaler_x.transform(X_train)
X_test_scaled = scaler_x.transform(X_test)

morgan_fp = []

for index, row in X_train.iterrows():
    smiles = X_train_smiles.loc[index]
    m = Chem.MolFromSmiles(smiles)
    
    fp1 = AllChem.GetMorganFingerprintAsBitVect(m,4)
    morgan_fp.append(fp1)
    print("\r{:5>}".format(index), end='')
    
dist_mat = np.zeros((len(morgan_fp),len(morgan_fp)))

for i in range(len(morgan_fp)):
    for j in range(len(morgan_fp)):
        dist = (1- DataStructs.TanimotoSimilarity(morgan_fp[i],morgan_fp[j]))
        dist_mat[i,j] = dist  

foldHostileCV = 8
hclust = AgglomerativeClustering(affinity = "precomputed", # means X = distance matrix
                                 n_clusters=foldHostileCV,
                                 compute_full_tree=True,
                                 linkage="complete"
                                )

label_filename = "Logit_L1REG"
label_plot = "Logistic regression ($l_1$ regularized)"

fpr_cv = []
tpr_cv = []
thresholds_cv = []
auc_cv = []

actually_done_folds = 0

for numClust in range(foldHostileCV):
    print("[Most hostile] fold {}".format(numClust), end='')
    testFoldIdx = np.where(hclust.labels_ == numClust)[0]
    trainFoldsIdx = np.where(hclust.labels_ != numClust)[0]
    
    X_train_currentFold = X_train.iloc[trainFoldsIdx]
    X_test_currentFold = X_train.iloc[testFoldIdx]
    
    y_train_currentFold = y_train.iloc[trainFoldsIdx]
    y_test_currentFold = y_train.iloc[testFoldIdx]

    if (y_train_currentFold.sum() == 0) or (y_test_currentFold.sum() == 0):
        print(" -> skipping, no positive sample in fold\n", end="")
        continue
        
    if (y_train_currentFold.sum() == len(y_train_currentFold)) or (y_test_currentFold.sum() == len(y_train_currentFold)):
        print(" -> skipping, no negative sample in fold\n", end="")
        continue
    
    clf = linear_model.LogisticRegression(penalty = 'l2', # L2-reg
                         C = 5.43,
                         random_state=42, 
                         solver='saga',
                         multi_class='ovr',
                         max_iter=10000)

    clf.fit(X_train_currentFold,y_train_currentFold)
    
    resProba_i = clf.predict_proba(X_test_currentFold)

    fpr, tpr, thresholds = roc_curve(y_test_currentFold.ravel(), resProba_i[:,1])
    roc_auc = auc(fpr, tpr)
    
    tpr_cv.append(tpr)
    fpr_cv.append(fpr)
    thresholds_cv.append(thresholds)
    auc_cv.append(roc_auc)
    
    actually_done_folds += 1
    print(" Done\n", end='')
    

maxlength = max([len(x) for x in fpr_cv])

fpr_interp = []
tpr_interp = []

for i in range(len(fpr_cv)):
    arr2_interp = interp.interp1d(np.arange(fpr_cv[i].size),fpr_cv[i])
    arr2_stretch = arr2_interp(np.linspace(0,fpr_cv[i].size-1,maxlength))
    fpr_interp.append(arr2_stretch)
    

tpr_interp = []

for i in range(len(tpr_cv)):
    arr2_interp = interp.interp1d(np.arange(tpr_cv[i].size),tpr_cv[i])
    arr2_stretch = arr2_interp(np.linspace(0,tpr_cv[i].size-1,maxlength))
    tpr_interp.append(arr2_stretch)
    
plt.figure(figsize=(8,5), dpi=300)

for i in range(len(fpr_cv)):
    plt.plot(np.array(fpr_cv[i]),np.array(tpr_cv[i]), alpha=0.2, c="red")

    auc_mean = np.mean(auc_cv)
    auc_std = np.std(auc_cv)

    mean_fpr = np.mean(fpr_interp, axis=0)
    mean_tpr = np.mean(tpr_interp, axis=0)
    std_tpr = np.std(tpr_interp, axis=0)

    plt.plot(mean_fpr, mean_tpr, color='b',
            label=r'Mean ROC (AUC = %0.2f $\pm$ %0.2f)' % (auc_mean, auc_std),
            lw=2, alpha=.8)

    tprs_upper = np.minimum(mean_tpr + std_tpr, 1)
    tprs_lower = np.maximum(mean_tpr - std_tpr, 0)
    plt.fill_between(mean_fpr, tprs_lower, tprs_upper, color='grey', alpha=.2,
                    label=r'$\pm$ 1 std. dev.')

    plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')    
    plt.xlim([-0.05, 1.05])
    plt.ylim([-0.05, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC : {}\nUnder {}-fold most-structurally-hostile cross-validation'.format(label_plot,actually_done_folds))
    plt.legend(loc="lower right")
    plt.savefig('HOSTILE_10CV_ROC_{}.png'.format(label_filename), dpi=400)
    plt.show()

Perspectives

Is it actually good to have models that are entirely independent of structure ? Probably not, so a drop in performance on this test may be expected. But not too much of a drop.

MSH-CV is, of course, centred on structure, but other variants may exist : if you want to optimise performance for X metric, but you have Y possible side-channel, then cluster based on Y and do the same kind of protocol.

Anyway that’s all for this post.