In clinical and biomedical research, study databases are often characterized by complex relationships between the collected measurements. For example, researchers might have to deal with repeated and/or clustered measurements, or might want to analyze non-standard outcome variables like ratios, percentages, scores, or counts with a large number of zeros. Classical statistical modeling tools (like linear least squares regression) are often inadequate in these situations. For this reason, research in biostatistics is concerned with the development of novel techniques to describe and infer the relationships in „non-standard“ clinical and biomedical data. Our group has developed, among other methods, a mixed-effects model for longitudinal measurements of atrophy size (in patients with age-related macular degeneration, Behning et al. 2021) and a modeling approach for outcomes that are given by the ratio of two correlated random variables (like the amyloid-beta 42/40 ratio in dementia research, Berger et al. 2019).
Many clinical and epidemiological studies are defined by a longitudinal design, collecting data from patients during a pre-defined time period with regular follow-up visits. In these studies, researchers are often interested in a set of target events (such as death or disease progression) that might be experienced by the study participants at some time after the beginning of the study. The analysis of such „time-to-event“ data is often challenging, as it is usually not possible to record all target events during the study period (for instance, because some participants might have left the study before having experienced any of the events). This phenomenon is called „censoring“. Our group is interested in statistical methods for the analysis of censored data, having a focus on so-called „discrete“ event times that arise from data collection at a fixed set of follow-up visits (Tutz & Schmid 2016). For example, we developed methods to model the incidence of nosocomial pneumonia in the presence of one or more „competing“ target events (Berger et al. 2020). Furthermore, we are active in the analysis of performance measures for time-to-event models (like discrimination and calibration, Schmid et al. 2018, Berger & Schmid 2022).
In non-randomized and/or explorative studies, researchers often have to deal with large numbers of variables. When the aim is to build a statistical model from these variables (e.g. for diagnosis or prognosis), it is usually not possible to include all of them in the model. For this reason, variable selection is a key issue in statistical model building (Sauerbrei et al. 2020, Bommert et al. 2022). During the past years, statistical learning techniques (also termed machine learning techniques) have been successful in dealing with large numbers of variables, yielding improved predictions even when sample sizes are comparably small. On the other hand, the improved prediction accuracy often comes at the price of a limited interpretability, making it hard to infer the characteristics of the predictor-response relationships (so-called black-box models). As the explanation of the predictor-response relationships is usually important to our clinical collaborators, we are interested in the development of statistical learning methods that bridge the gap between prediction accuracy and interpretability. For example, we are active in the improvement of gradient boosting algorithms, which can be modified such that they produce model fits having same structure as standard fits obtained from linear or logistic regression. Furthermore, we are interested in the development and application of explainable AI methods, which can be used to increase the interpretability of black-box methods (like random forests or tree boosting) by post-processing the respective predictions (Welchowski et al. 2022, Maloney et al. 2022).
Institute for Medical Biometry,
Informatics and Epidemiology (IMBIE)
University of Bonn