1
|
Mukherjee S, Cogan JD, Newman JH, Phillips JA, Hamid R, Meiler J, Capra JA. Identifying digenic disease genes via machine learning in the Undiagnosed Diseases Network. Am J Hum Genet 2021; 108:1946-1963. [PMID: 34529933 PMCID: PMC8546038 DOI: 10.1016/j.ajhg.2021.08.010] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2020] [Accepted: 08/25/2021] [Indexed: 12/20/2022] Open
Abstract
Rare diseases affect millions of people worldwide, and discovering their genetic causes is challenging. More than half of the individuals analyzed by the Undiagnosed Diseases Network (UDN) remain undiagnosed. The central hypothesis of this work is that many of these rare genetic disorders are caused by multiple variants in more than one gene. However, given the large number of variants in each individual genome, experimentally evaluating combinations of variants for potential to cause disease is currently infeasible. To address this challenge, we developed the digenic predictor (DiGePred), a random forest classifier for identifying candidate digenic disease gene pairs by features derived from biological networks, genomics, evolutionary history, and functional annotations. We trained the DiGePred classifier by using DIDA, the largest available database of known digenic-disease-causing gene pairs, and several sets of non-digenic gene pairs, including variant pairs derived from unaffected relatives of UDN individuals. DiGePred achieved high precision and recall in cross-validation and on a held-out test set (PR area under the curve > 77%), and we further demonstrate its utility by using digenic pairs from the recent literature. In contrast to other approaches, DiGePred also appropriately controls the number of false positives when applied in realistic clinical settings. Finally, to enable the rapid screening of variant gene pairs for digenic disease potential, we freely provide the predictions of DiGePred on all human gene pairs. Our work enables the discovery of genetic causes for rare non-monogenic diseases by providing a means to rapidly evaluate variant gene pairs for the potential to cause digenic disease.
Collapse
Affiliation(s)
- Souhrid Mukherjee
- Department of Biological Sciences, Vanderbilt University, Nashville, TN 37235, USA
| | - Joy D Cogan
- Department of Pediatrics, Division of Medical Genetics and Genomic Medicine, Vanderbilt University School of Medicine, Nashville, TN 37232, USA
| | - John H Newman
- Pulmonary Hypertension Center, Division of Allergy, Pulmonary, and Critical Care Medicine, Vanderbilt University Medical Center, Nashville, TN 37232, USA
| | - John A Phillips
- Department of Pediatrics, Division of Medical Genetics and Genomic Medicine, Vanderbilt University School of Medicine, Nashville, TN 37232, USA
| | - Rizwan Hamid
- Department of Pediatrics, Division of Medical Genetics and Genomic Medicine, Vanderbilt University School of Medicine, Nashville, TN 37232, USA
| | - Jens Meiler
- Department of Chemistry, Vanderbilt University, Nashville, TN 37235, USA; Department of Pharmacology, Vanderbilt University, Nashville, TN 37235, USA; Center for Structural Biology, Vanderbilt University, Nashville, TN 37235, USA; Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37232, USA; Institute for Drug Discovery, Leipzig University Medical School, Leipzig 04103, Germany; Department of Chemistry, Leipzig University, Leipzig 04109, Germany; Department of Computer Science, Leipzig University, Leipzig 04109, Germany.
| | - John A Capra
- Department of Biological Sciences, Vanderbilt University, Nashville, TN 37235, USA; Center for Structural Biology, Vanderbilt University, Nashville, TN 37235, USA; Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37232, USA; Vanderbilt Genetics Institute, Vanderbilt University Medical Center, Nashville, TN 37232, USA; Bakar Computational Health Sciences Institute and Department of Epidemiology and Biostatistics, University of California, San Francisco, San Francisco, CA 94143, USA.
| |
Collapse
|