TY - JOUR
T1 - Omics-informed CNV calls reduce false-positive rates and improve power for CNV-trait associations
AU - Estonian Biobank Research Team
AU - Lepamets, Maarja
AU - Auwerx, Chiara
AU - Nõukas, Margit
AU - Claringbould, Annique
AU - Porcu, Eleonora
AU - Kals, Mart
AU - Jürgenson, Tuuli
AU - Morris, Andrew Paul
AU - Võsa, Urmo
AU - Bochud, Murielle
AU - Stringhini, Silvia
AU - Wijmenga, Cisca
AU - Franke, Lude
AU - Peterson, Hedi
AU - Vilo, Jaak
AU - Lepik, Kaido
AU - Mägi, Reedik
AU - Kutalik, Zoltán
N1 - Funding Information:
We thank participants of EstBB, UKB, SkiPOGH, and LLDeep for their data provision. The scholarships for PhD student mobility of M.L. and K.L. to work with SkiPOGH ( University of Lausanne , Switzerland) and LLDeep ( University of Groningen , the Netherlands) data and funding for EstBB data acquisition were provided by the European Regional Development Fund (project nos. 2014-2020.4.01.16-0125 and 2014-2020.4.01.15-0012 and EXCITE). This project has benefited from the funding of European Union’s Horizon 2020 Research and Innovation Programme under grant agreement nos. 101016775 and 633666 and of the Estonian Research Council grant PUT ( PRG687 , PRG555 , PRG1095 , and PUT JD817 ). Z.K. was supported by the Department of Computational Biology at the University of Lausanne . M.N. was supported by Jacobs Foundation Research Fellowship (grant 2016 1217 09 , to Dr. Katrin Männik, Health 2030 Genome Center , Geneva, Switzerland). The SKIPOGH study was supported by grants from the Swiss National Science Foundation ( 33CM30-124087 and 33CM30-140331 ). EstBB and part of the UKB computations were carried out on the High-Performance Computing (HPC) Center, University of Tartu. UKB association study was carried out on the JURA server, University of Lausanne. SkiPOGH computations were carried out on the HPC1 computational server of the University Hospital of Lausanne. We thank UMCG Genomics Coordination Center, the UG Center for Information Technology, and their sponsors BBMRI-NL and TarGet for storage and compute infrastructure for LLDeep data. Finally, we thank Natàlia Pujol Gualdo and Triin Laisk for critical reading of the manuscript.
Funding Information:
We thank participants of EstBB, UKB, SkiPOGH, and LLDeep for their data provision. The scholarships for PhD student mobility of M.L. and K.L. to work with SkiPOGH (University of Lausanne, Switzerland) and LLDeep (University of Groningen, the Netherlands) data and funding for EstBB data acquisition were provided by the European Regional Development Fund (project nos. 2014-2020.4.01.16-0125 and 2014-2020.4.01.15-0012 and EXCITE). This project has benefited from the funding of European Union's Horizon 2020 Research and Innovation Programme under grant agreement nos. 101016775 and 633666 and of the Estonian Research Council grant PUT (PRG687, PRG555, PRG1095, and PUT JD817). Z.K. was supported by the Department of Computational Biology at the University of Lausanne. M.N. was supported by Jacobs Foundation Research Fellowship (grant 2016 1217 09, to Dr. Katrin Männik, Health 2030 Genome Center, Geneva, Switzerland). The SKIPOGH study was supported by grants from the Swiss National Science Foundation (33CM30-124087 and 33CM30-140331). EstBB and part of the UKB computations were carried out on the High-Performance Computing (HPC) Center, University of Tartu. UKB association study was carried out on the JURA server, University of Lausanne. SkiPOGH computations were carried out on the HPC1 computational server of the University Hospital of Lausanne. We thank UMCG Genomics Coordination Center, the UG Center for Information Technology, and their sponsors BBMRI-NL and TarGet for storage and compute infrastructure for LLDeep data. Finally, we thank Natàlia Pujol Gualdo and Triin Laisk for critical reading of the manuscript. The authors declare no competing interests.
Publisher Copyright:
© 2022 The Author(s)
PY - 2022/10/13
Y1 - 2022/10/13
N2 - Copy-number variations (CNV) are believed to play an important role in a wide range of complex traits, but discovering such associations remains challenging. While whole-genome sequencing (WGS) is the gold-standard approach for CNV detection, there are several orders of magnitude more samples with available genotyping microarray data. Such array data can be exploited for CNV detection using dedicated software (e.g., PennCNV); however, these calls suffer from elevated false-positive and -negative rates. In this study, we developed a CNV quality score that weights PennCNV calls (pCNVs) based on their likelihood of being true positive. First, we established a measure of pCNV reliability by leveraging evidence from multiple omics data (WGS, transcriptomics, and methylomics) obtained from the same samples. Next, we built a predictor of omics-confirmed pCNVs, termed omics-informed quality score (OQS), using only PennCNV software output parameters. Promisingly, OQS assigned to pCNVs detected in close family members was up to 35% higher than the OQS of pCNVs not carried by other relatives (p < 3.0 × 10−90), outperforming other scores. Finally, in an association study of four anthropometric traits in 89,516 Estonian Biobank samples, the use of OQS led to a relative increase in the trait variance explained by CNVs of up to 56% compared with published quality filtering methods or scores. Overall, we put forward a flexible framework to improve any CNV detection method leveraging multi-omics evidence, applied it to improve PennCNV calls, and demonstrated its utility by improving the statistical power for downstream association analyses.
AB - Copy-number variations (CNV) are believed to play an important role in a wide range of complex traits, but discovering such associations remains challenging. While whole-genome sequencing (WGS) is the gold-standard approach for CNV detection, there are several orders of magnitude more samples with available genotyping microarray data. Such array data can be exploited for CNV detection using dedicated software (e.g., PennCNV); however, these calls suffer from elevated false-positive and -negative rates. In this study, we developed a CNV quality score that weights PennCNV calls (pCNVs) based on their likelihood of being true positive. First, we established a measure of pCNV reliability by leveraging evidence from multiple omics data (WGS, transcriptomics, and methylomics) obtained from the same samples. Next, we built a predictor of omics-confirmed pCNVs, termed omics-informed quality score (OQS), using only PennCNV software output parameters. Promisingly, OQS assigned to pCNVs detected in close family members was up to 35% higher than the OQS of pCNVs not carried by other relatives (p < 3.0 × 10−90), outperforming other scores. Finally, in an association study of four anthropometric traits in 89,516 Estonian Biobank samples, the use of OQS led to a relative increase in the trait variance explained by CNVs of up to 56% compared with published quality filtering methods or scores. Overall, we put forward a flexible framework to improve any CNV detection method leveraging multi-omics evidence, applied it to improve PennCNV calls, and demonstrated its utility by improving the statistical power for downstream association analyses.
KW - anthropometric traits
KW - copy-number variation
KW - gene expression
KW - methylation
KW - multi-omics
KW - PennCNV
KW - structural variation
KW - whole genome sequencing
U2 - 10.1016/j.xhgg.2022.100133
DO - 10.1016/j.xhgg.2022.100133
M3 - Article
AN - SCOPUS:85135774703
SN - 2666-2477
VL - 3
JO - Human Genetics and Genomics Advances
JF - Human Genetics and Genomics Advances
IS - 4
M1 - 100133
ER -