Prof. Raja Awarded NSF EAGER Grant
August 26, 2014
Professor Anita Raja was awared National Science Foundation (NSF) grant for EAGER: Collaborative Research: Advanced Machine Learning for Prediction of Preterm Birth.
The United States spends over 26 billion dollars per annum on the delivery and care of the 12-13% of infants who are born preterm. As preterm birth (PTB) is a major public health problem with profound implications on society, there would be extreme value in being able to identify women at risk of preterm birth during the course of their pregnancy. Previous predictive approaches have been largely unsuccessful since they have focused on a limited number of well described risk factors known to be correlated with preterm birth (e.g. prior preterm birth, race, and infection) and less on combining multiple factors. The latter approach is necessary to understand the complex etiologies of preterm birth. While identifying individual PTB risk factors has brought insight into the problem and has led in some cases to successful treatments such as progesterone for women with a previous preterm birth, this has only a limited impact on the overall frequency since many at risk patients, such as first time mothers (nulliparous), go untreated. Today, there is still no widely tested prediction system that combines well-known PTB factors and is clinically useful. There is, however, a global awareness of the need to discover and integrate the complex etiologies of prematurity in order to predict women at risk. Significant efforts have been made in the last couple of decades to collect large curated datasets of pregnant women. Previous studies on these datasets used relatively straightforward biostatitistical methodologies such as relative risk assessments to measure associations between factors and PTB. However, risk factors are studied independently of each other, which does not account for the multifactorial complexity of PTB. This exploratory project aims to investigate the value of more advanced machine learning methods by simultaneously considering all the factors, to develop better predictive methods.
The PTB data acquired in the context of this project brings together Electronic Health Records (EHRs) for mothers and their babies along with well-curated NIH data. The data is rich with structured clinical data and unstructured free text that require manual feature extraction. This project, largely motivated by the PTB problem, has two main goals:
(1) Improving the quality and aggregation the annotations for heterogeneous data. The researchers aim to capture socioeconomic, psychological and behavioral risk factors documented in the text of clinical notes via studying the process of manual feature extraction by human annotators. State-of-the-art methods either rely on the expertise of the annotator and/or the difficulty of the instance but ignore the variability in the quality of labeling over time due to fatigue, boredom, or knowledge. To improve the annotations, the project will develop a novel Bayesian framework for human labeling of unstructured data. The Bayesian model will embed a complete set of parameters including the prevalence of each class, difficulty of the instance and variability in the quality of annotation during the process. If the model construction is successful, then the developed framework will replace ad-hoc heuristics into a well-designed process for producing high quality annotations. This framework would allow extracting reliable features from the clinical text for subsequent analyses in devising PTB prediction models.
(2) Developing predictive models for multiple data spaces. To leverage all of the existing data, the project will investigate the value of using Vapnik's paradigm of Learning Using Privileged Information (LUPI) in the context of preterm birth. Privileged information is a data that is available for training models but is not available for test examples. Data in this project come with two potential privileged information spaces namely the clinical notes and the space of future events. NICU data is an example of future event privileged information, which is only available for a subset of the examples (only premature babies requiring intensive care stay in the NICU). It has been shown that LUPI not only induces a better decision rule, it also increases the rate of convergence of the algorithm, hence requiring fewer training examples. This is a compelling property in the case of PTB prediction because of the rate of PTB. The project will extend LUPI into a powerful and applicable framework to handle the two spaces of privileged information, while developing spline-generating kernels, to manage LUPI's high computational cost. If successful, this proof-of-concept is expected to yield efficient and widely applicable LUPI algorithms in domains where privileged information is available, such as the financial domain and many other medical applications.
The developed software, publications and datasets resulting from this project will be made publicly available to the research community through the project website.