MOTIVATION: Identifying sample mix-ups in biobanks is essential to allow the repurposing of genetic data for clinical pharmacogenetics. Pharmacogenetic advice based on the genetic information of another individual is potentially harmful. Existing methods for identifying mix-ups are limited to datasets in which additional omics data (e.g., gene expression) is available. Cohorts lacking such data can only use sex, which can reveal only half of the mix-ups. Here, we describe Idéfix, a method for the identification of accidental sample mix-ups in biobanks using polygenic scores.
RESULTS: In the Lifelines population-based biobank we calculated polygenic scores (PGSs) for 25 traits for 32,786 participants. Idéfix then compares the actual phenotypes to PGSs and uses the relative discordance that is expected for mix-ups, compared to correct samples. In a simulation, using induced mix-ups, Idéfix reaches an AUC of 0.90 using 25 polygenic scores and sex. This is a substantial improvement over using only sex, which has an AUC of 0.75. Subsequent simulations present Idéfix's potential in varying datasets with more powerful PGSs. This suggests its performance will likely improve, when more highly powered GWASs for commonly measured traits will become available. Idéfix can be used to identify a set of high-quality participants for whom it is very unlikely that they reflect sample mix-ups, and for these participants we can use genetic data for clinical purposes, such as pharmacogenetic profiles. For instance, in Lifelines we can select 34.4% of participants, reducing the sample mix-up rate from 0.15% to 0.01%.
AVAILABILITY: Idéfix is freely available at https://github.com/molgenis/systemsgenetics/wiki/Idefix.
SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.