PURPOSE: Multivariate modeling of complications after radiotherapy is frequently used in conjunction with data driven variable selection. This study quantifies the risk of overfitting in a data driven modeling method using bootstrapping for data with typical clinical characteristics, and estimates the minimum amount of data needed to obtain models with relatively high predictive power.
MATERIALS AND METHODS: To facilitate repeated modeling and cross-validation with independent datasets for the assessment of true predictive power, a method was developed to generate simulated data with statistical properties similar to real clinical data sets. Characteristics of three clinical data sets from radiotherapy treatment of head and neck cancer patients were used to simulate data with set sizes between 50 and 1000 patients. A logistic regression method using bootstrapping and forward variable selection was used for complication modeling, resulting for each simulated data set in a selected number of variables and an estimated predictive power. The true optimal number of variables and true predictive power were calculated using cross-validation with very large independent data sets.
RESULTS: For all simulated data set sizes the number of variables selected by the bootstrapping method was on average close to the true optimal number of variables, but showed considerable spread. Bootstrapping is more accurate in selecting the optimal number of variables than the AIC and BIC alternatives, but this did not translate into a significant difference of the true predictive power. The true predictive power asymptotically converged toward a maximum predictive power for large data sets, and the estimated predictive power converged toward the true predictive power. More than half of the potential predictive power is gained after approximately 200 samples. Our simulations demonstrated severe overfitting (a predicative power lower than that of predicting 50% probability) in a number of small data sets, in particular in data sets with a low number of events (median: 7, 95th percentile: 32). Recognizing overfitting from an inverted sign of the estimated model coefficients has a limited discriminative value.
CONCLUSIONS: Despite considerable spread around the optimal number of selected variables, the bootstrapping method is efficient and accurate for sufficiently large data sets, and guards against overfitting for all simulated cases with the exception of some data sets with a particularly low number of events. An appropriate minimum data set size to obtain a model with high predictive power is approximately 200 patients and more than 32 events. With fewer data samples the true predictive power decreases rapidly, and for larger data set sizes the benefit levels off toward an asymptotic maximum predictive power.