Multi-View Representation Learning to Schema Match Databases

Abstract

An applied problem facing all areas of data science is harmonizing data sources. Joining data from multiple origins with unmapped and only partially overlapping features is a prerequisite to developing and testing robust, generalizable algorithms, especially in health care. This joining is usually resolved using meta-data, which may be unavailable or ambiguous in a large database. We design and evaluate methods for mapping features between databases independent of meta-data. We evaluate methods in the challenging case of numeric features without reliable and distinctive univariate summaries, such as nearly Gaussian and binary features. We assume that a small set of features are a priori mapped between two databases, which share unknown identical features and possibly many unrelated features. We compare the performance of contrastive learning methods for feature representations, novel partial auto-encoders, mutual-information graph optimizers, and simple statistical baselines on simulated data, public datasets, the MIMIC-III medical-record changeover, and perioperative records from before and after a medical-record system change. Performance was evaluated using both mapping of identical features and reconstruction accuracy of examples in the format of the other database. Contrastive learning-based methods overall performed the best, often substantially beating the literature baseline in matching and reconstruction, especially in the more challenging real data experiments. Partial auto-encoder (chimeric) method showed on par matching with contrastive methods in all synthetic and some real datasets, along with good reconstruction. However, the statistical method we created performed reasonably well in many cases, with much less dependence on hyperparameter tuning. We also validated the matches in EHR datasets and found that some mistakes were actually a surrogate or related features as reviewed by two subject matter experts. In simulation studies and several real-world examples, we find that summaries of inter-feature relationships are effective at identifying matching or closely related features across databases when meta-data is not available. Decoder architectures are also reasonably effective at imputing features without an exact match.

Sandhya Tripathi
Sandhya Tripathi
Postdoctoral Research Associate

My research interests include clinical prediction model, fairness in AI models, database matching, and learning in the presence of label noise.