TY - JOUR
T1 - MOLGENIS/connect
T2 - a system for semi-automatic integration of heterogeneous phenotype data with applications in biobanks
AU - Pang, Chao
AU - van Enckevort, David
AU - de Haan, Mark
AU - Kelpin, Fleur
AU - Jetten, Jonathan
AU - Hendriksen, Dennis
AU - de Boer, Tommy
AU - Charbon, Bart
AU - Winder, Erwin
AU - Velde, van der, K. Joeri
AU - Doiron, Dany
AU - Fortier, Isabel
AU - Hillege, Hans
AU - Swertz, Morris A.
PY - 2016/7/15
Y1 - 2016/7/15
N2 - Motivation: While the size and number of biobanks, patient registries and other data collections are increasing, biomedical researchers still often need to pool data for statistical power, a task that requires time-intensive retrospective integration. Results: To address this challenge, we developed MOLGENIS/connect, a semi-automatic system to find, match and pool data from different sources. The system shortlists relevant source attributes from thousands of candidates using ontology-based query expansion to overcome variations in terminology. Then it generates algorithms that transform source attributes to a common target DataSchema. These include unit conversion, categorical value matching and complex conversion patterns (e.g. calculation of BMI). In comparison to human-experts, MOLGENIS/connect was able to auto-generate 27% of the algorithms perfectly, with an additional 46% needing only minor editing, representing a reduction in the human effort and expertise needed to pool data. Availability and Implementation: Source code, binaries and documentation are available as open-source under LGPLv3 from http://github.com/molgenis/molgenis and www.molgenis.org/connect.
AB - Motivation: While the size and number of biobanks, patient registries and other data collections are increasing, biomedical researchers still often need to pool data for statistical power, a task that requires time-intensive retrospective integration. Results: To address this challenge, we developed MOLGENIS/connect, a semi-automatic system to find, match and pool data from different sources. The system shortlists relevant source attributes from thousands of candidates using ontology-based query expansion to overcome variations in terminology. Then it generates algorithms that transform source attributes to a common target DataSchema. These include unit conversion, categorical value matching and complex conversion patterns (e.g. calculation of BMI). In comparison to human-experts, MOLGENIS/connect was able to auto-generate 27% of the algorithms perfectly, with an additional 46% needing only minor editing, representing a reduction in the human effort and expertise needed to pool data. Availability and Implementation: Source code, binaries and documentation are available as open-source under LGPLv3 from http://github.com/molgenis/molgenis and www.molgenis.org/connect.
U2 - 10.1093/bioinformatics/btw155
DO - 10.1093/bioinformatics/btw155
M3 - Article
VL - 32
SP - 2176
EP - 2183
JO - Bioinformatics
JF - Bioinformatics
SN - 1367-4803
IS - 14
ER -