1
|
Feuermann M, Gaudet P. Interpreting Gene Ontology Annotations Derived from Sequence Homology Methods. Methods Mol Biol 2024; 2836:285-298. [PMID: 38995546 DOI: 10.1007/978-1-0716-4007-4_15] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/13/2024]
Abstract
The Gene Ontology (GO) project describes the functions of the gene products of organisms from all kingdoms of life in a standardized way, enabling powerful analyses of experiments involving genome-wide analysis. The scientific literature is used to convert experimental results into GO annotations that systematically classify gene products' functions. However, to address the fact that only a minor fraction of all genes has been characterized experimentally, multiple predictive methods to assign GO annotations have been developed since the inception of GO. Sequence homologies between novel genes and genes with known functions help to approximate the roles of these non-characterized genes. Here we describe the main sequence homology methods to produce annotations: pairwise comparison (BLAST), protein profile models (InterPro), and phylogenetic-based annotation (PAINT). Some of these methods can be implemented with genome analysis pipelines (BLAST and InterPro2GO), while PAINT is curated by the GO consortium.
Collapse
Affiliation(s)
- Marc Feuermann
- SIB Swiss Institute of Bioinformatics, Geneva, Switzerland
| | - Pascale Gaudet
- SIB Swiss Institute of Bioinformatics, Geneva, Switzerland.
| |
Collapse
|
2
|
Li W, Shih A, Freudenberg-Hua Y, Fury W, Yang Y. Beyond standard pipeline and p < 0.05 in pathway enrichment analyses. Comput Biol Chem 2021; 92:107455. [PMID: 33774420 PMCID: PMC9179938 DOI: 10.1016/j.compbiolchem.2021.107455] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2020] [Revised: 12/18/2020] [Accepted: 02/07/2021] [Indexed: 10/22/2022]
Abstract
A standard pathway/gene-set enrichment analysis, the over-representation analysis, is based on four values: the size of two gene-sets, size of their overlap, and size of the gene universe from which the gene-sets are chosen. The standard result of such an analysis is based on the p-value of a statistical test. We supplement this standard pipeline by six cautions: (1) any p-value threshold to distinguish enriched gene-sets from not-enriched ones is to certain degree arbitrary; (2) genes in a gene-set may be correlated, which potentially overcount the gene-set size; (3) any attempt to impose multiple testing correction will increase the false negative rate; (4) gene-sets in a gene-set database may be correlated, potentially overcount the factor for multiple testing correction; (5) the discrete nature of the data make it possible that a minimum change in counts may lead to a quantum change in the p-value threshold-based conclusion; (6) the two gene-sets may not be chosen from the universe of all human genes, but in fact from a subset of that universe, or even two different subsets of all genes. Careful reconsideration of these issues can have an impact on an enrichment analysis conclusion. Part of our cautions mirror the call from statistician that reaching conclusion from data is not a simple matter of p-value smaller than 0.05, but a thoughtful process with due diligences.
Collapse
Affiliation(s)
- Wentian Li
- The Robert S. Boas Center for Genomics and Human Genetics, The Feinstein Institutes for Medical Research, Northwell Health, Manhasset, NY, USA
| | - Andrew Shih
- The Robert S. Boas Center for Genomics and Human Genetics, The Feinstein Institutes for Medical Research, Northwell Health, Manhasset, NY, USA
| | - Yun Freudenberg-Hua
- Litwin-Zucker Center for the study of Alzheimer's Disease, The Feinstein Institutes for Medical Research, Northwell Health, Manhasset, NY, USA; Division of Geriatric Psychiatry, Zucker Hillside Hospital, Northwell Health, Glen Oaks, NY, USA
| | - Wen Fury
- Regeneron Pharmaceutical Inc., Tarrytown, NY, USA
| | - Yaning Yang
- Department of Statistics and Finance, University of Science and Technology of China, Hefei, Anhui, China
| |
Collapse
|
3
|
Malatras A, Duguez S, Duddy W. Muscle Gene Sets: a versatile methodological aid to functional genomics in the neuromuscular field. Skelet Muscle 2019; 9:10. [PMID: 31053169 PMCID: PMC6498474 DOI: 10.1186/s13395-019-0196-z] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2019] [Accepted: 04/09/2019] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND The approach of building large collections of gene sets and then systematically testing hypotheses across these collections is a powerful tool in functional genomics, both in the pathway analysis of omics data and to uncover the polygenic effects associated with complex diseases in genome-wide association study. The Molecular Signatures Database includes collections of oncogenic and immunologic signatures enabling researchers to compare transcriptional datasets across hundreds of previous studies and leading to important insights in these fields, but such a resource does not currently exist for neuromuscular research. In previous work, we have shown the utility of gene set approaches to understand muscle cell physiology and pathology. METHODS Following a systematic survey of public muscle data, we passed gene expression profiles from 4305 samples through a robust pre-processing and standardized data analysis pipeline. Two hundred eighty-two samples were discarded based on a battery of rigorous global quality controls. From among the remaining studies, 578 comparisons of interest were identified by a combination of text mining and manual curation of the study meta-data. For each comparison, significantly dysregulated genes (FDR adjusted p < 0.05) were identified. RESULTS Lists of dysregulated genes were divided between upregulated and downregulated to give 1156 Muscle Gene Sets (MGS). This resource is available for download ( www.sys-myo.com/muscle_gene_sets ) and is accessible through three commonly used functional genomics platforms (GSEA, EnrichR, and WebGestalt). Basic guidance and recommendations are provided for the use of MGS through these platforms. In addition, consensus muscle gene sets were created to capture the overlap between the results of similar studies, and analysis of these highlighted the potential for novel disease-relevant findings. CONCLUSIONS The MGS resource can be used to investigate the behaviour of any list of genes across previous comparisons of muscle conditions, to compare previous studies to one another, and to explore the functional relationship of muscle dysregulation to the Gene Ontology. Its major intended use is in enrichment testing for functional genomics analysis.
Collapse
Affiliation(s)
- Apostolos Malatras
- Myologie Centre de Recherche, Université Sorbonne, UMRS 974 UPMC, INSERM, FRE 3617 CNRS, AIM, Paris, France
- Northern Ireland Centre for Stratified Medicine, Biomedical Sciences Research Institute, C-TRIC, Ulster University, Altnagelvin Hospital Campus, Glenshane Road, Derry/Londonderry, BT47 6SB UK
- Department of Biological Sciences, Molecular Medicine Research Center, University of Cyprus, 1 University Avenue, 2109 Nicosia, Cyprus
| | - Stephanie Duguez
- Myologie Centre de Recherche, Université Sorbonne, UMRS 974 UPMC, INSERM, FRE 3617 CNRS, AIM, Paris, France
- Northern Ireland Centre for Stratified Medicine, Biomedical Sciences Research Institute, C-TRIC, Ulster University, Altnagelvin Hospital Campus, Glenshane Road, Derry/Londonderry, BT47 6SB UK
| | - William Duddy
- Myologie Centre de Recherche, Université Sorbonne, UMRS 974 UPMC, INSERM, FRE 3617 CNRS, AIM, Paris, France
- Northern Ireland Centre for Stratified Medicine, Biomedical Sciences Research Institute, C-TRIC, Ulster University, Altnagelvin Hospital Campus, Glenshane Road, Derry/Londonderry, BT47 6SB UK
| |
Collapse
|
4
|
Gershon ES, Pearlson G, Keshavan MS, Tamminga C, Clementz B, Buckley PF, Alliey-Rodriguez N, Liu C, Sweeney JA, Keedy S, Meda SA, Tandon N, Shafee R, Bishop JR, Ivleva EI. Genetic analysis of deep phenotyping projects in common disorders. Schizophr Res 2018; 195:51-57. [PMID: 29056493 PMCID: PMC5910299 DOI: 10.1016/j.schres.2017.09.031] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/02/2017] [Revised: 09/19/2017] [Accepted: 09/22/2017] [Indexed: 11/19/2022]
Abstract
Several studies of complex psychotic disorders with large numbers of neurobiological phenotypes are currently under way, in living patients and controls, and on assemblies of brain specimens. Genetic analyses of such data typically present challenges, because of the choice of underlying hypotheses on genetic architecture of the studied disorders and phenotypes, large numbers of phenotypes, the appropriate multiple testing corrections, limited numbers of subjects, imputations required on missing phenotypes and genotypes, and the cross-disciplinary nature of the phenotype measures. Advances in genotype and phenotype imputation, and in genome-wide association (GWAS) methods, are useful in dealing with these challenges. As compared with the more traditional single-trait analyses, deep phenotyping with simultaneous genome-wide analyses serves as a discovery tool for previously unsuspected relationships of phenotypic traits with each other, and with specific molecular involvements.
Collapse
Affiliation(s)
- Elliot S Gershon
- Department of Psychiatry, Department of Human Genetics, University of Chicago, United States.
| | - Godfrey Pearlson
- Yale University Departments of Psychiatry & Neuroscience, Hartford, CT, United States; Olin Neuropsychiatry Research Center, Institute of Living, Hartford, Connecticut, USA
| | | | - Carol Tamminga
- Department of Psychiatry, University of Texas Southwestern Medical Center, Dallas, TX, United States
| | - Brett Clementz
- Department of Psychology, University of Georgia, Athens, GA, United States
| | - Peter F Buckley
- School of Medicine Virginia Commonwealth University (VCU), Richmond, VA, United States
| | - Ney Alliey-Rodriguez
- University of Chicago, Department of Psychiatry and Behavioral Neurosciences, Chicago, IL, United States
| | - Chunyu Liu
- University of Illinois at Chicago, Chicago, IL, United States
| | - John A Sweeney
- Department of Psychiatry, University of Texas Southwestern Medical Center, Dallas, TX, United States; University of Cincinnati, Department of Psychiatry and Behavioral Neuroscience, Cincinnati, OH, United States
| | - Sarah Keedy
- University of Chicago, Department of Psychiatry and Behavioral Neurosciences, Chicago, IL, United States
| | - Shashwath A Meda
- Yale University Departments of Psychiatry & Neuroscience, Hartford, CT, United States
| | - Neeraj Tandon
- Beth Israel Deaconess Medical Center, Dept of Psychiatry, Harvard Medical School, United States
| | - Rebecca Shafee
- Broad Institute of MIT and Harvard, Cambridge, MA, United States; Department of Genetics, Harvard Medical School, United States
| | - Jeffrey R Bishop
- Department of Clinical and Experimental Pharmacology, University of Minnesota, Minneapolis, MN, United States
| | - Elena I Ivleva
- Department of Psychiatry, University of Texas Southwestern Medical Center, Dallas, TX, United States
| |
Collapse
|
5
|
Freudenberg-Hua Y, Li W, Davies P. The Role of Genetics in Advancing Precision Medicine for Alzheimer's Disease-A Narrative Review. Front Med (Lausanne) 2018; 5:108. [PMID: 29740579 PMCID: PMC5928202 DOI: 10.3389/fmed.2018.00108] [Citation(s) in RCA: 50] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2018] [Accepted: 04/03/2018] [Indexed: 12/12/2022] Open
Abstract
Alzheimer's disease (AD) is the most common type of dementia, which has a substantial genetic component. AD affects predominantly older people. Accordingly, the prevalence of dementia has been rising as the population ages. To date, there are no effective interventions that can cure or halt the progression of AD. The only available treatments are the management of certain symptoms and consequences of dementia. The current state-of-the-art medical care for AD comprises three simple principles: prevent the preventable, achieve early diagnosis, and manage the manageable symptoms. This review provides a summary of the current state of knowledge of risk factors for AD, biological diagnostic testing, and prospects for treatment. Special emphasis is given to recent advances in genetics of AD and the way genomic data may support prevention, early intervention, and development of effective pharmacological treatments. Mutations in the APP, PSEN1, and PSEN2 genes cause early onset Alzheimer's disease (EOAD) that follows a Mendelian inheritance pattern. For late onset Alzheimer's disease (LOAD), APOE4 was identified as a major risk allele more than two decades ago. Population-based genome-wide association studies of late onset AD have now additionally identified common variants at roughly 30 genetic loci. Furthermore, rare variants (allele frequency <1%) that influence the risk for LOAD have been identified in several genes. These genetic advances have broadened our insights into the biological underpinnings of AD. Moreover, the known genetic risk variants could be used to identify presymptomatic individuals at risk for AD and support diagnostic assessment of symptomatic subjects. Genetic knowledge may also facilitate precision medicine. The goal of precision medicine is to use biological knowledge and other health information to predict individual disease risk, understand disease etiology, identify disease subcategories, improve diagnosis, and provide personalized treatment strategies. We discuss the potential role of genetics in advancing precision medicine for AD along with its ethical challenges. We outline strategies to implement genomics into translational clinical research that will not only improve accuracy of dementia diagnosis, thus enabling more personalized treatment strategies, but may also speed up the discovery of novel drugs and interventions.
Collapse
Affiliation(s)
- Yun Freudenberg-Hua
- Litwin-Zucker Center for the study of Alzheimer’s Disease, The Feinstein Institute for Medical Research, Northwell Health, Manhasset, NY, United States
- Division of Geriatric Psychiatry, Zucker Hillside Hospital, Northwell Health, Glen Oaks, NY, United States
| | - Wentian Li
- Robert S Boas Center for Genomics and Human Genetics, The Feinstein Institute for Medical Research, Northwell Health, Manhasset, NY, United States
| | - Peter Davies
- Litwin-Zucker Center for the study of Alzheimer’s Disease, The Feinstein Institute for Medical Research, Northwell Health, Manhasset, NY, United States
| |
Collapse
|
6
|
Ding T, Xu J, Sun M, Zhu S, Gao J. Predicting microRNA biological functions based on genes discriminant analysis. Comput Biol Chem 2017; 71:230-235. [PMID: 29033260 DOI: 10.1016/j.compbiolchem.2017.09.008] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2017] [Accepted: 09/25/2017] [Indexed: 01/09/2023]
Abstract
Although thousands of microRNAs (miRNAs) have been identified in recent experimental efforts, it remains a challenge to explore their specific biological functions through molecular biological experiments. Since those members from same family share same or similar biological functions, classifying new miRNAs into their corresponding families will be helpful for their further functional analysis. In this study, we initially built a vector space by characterizing the features from miRNA sequences and structures according to their miRBase family organizations. Then we further assigned miRNAs into its specific miRNA families by developing a novel genes discriminant analysis (GDA) approach in this study. As can be seen from the results of new families from GDA, in each of these new families, there was a high degree of similarity among all members of nucleotide sequences. At the same time, we employed 10-fold cross-validation machine learning to achieve the accuracy rates of 68.68%, 80.74%, and 83.65% respectively for the original miRNA families with no less than two, three, and four members. The encouraging results suggested that the proposed GDA could not only provide a support in identifying new miRNAs' families, but also contributing to predicting their biological functions.
Collapse
Affiliation(s)
- Tao Ding
- School of Science, Jiangnan University, Wuxi, China; School of Mathematics and Statistics, Newcastle University, Newcastle upon Tyne, UK.
| | - Junhua Xu
- School of Science, Jiangnan University, Wuxi, China.
| | - Mengmeng Sun
- School of Science, Jiangnan University, Wuxi, China.
| | - Shanshan Zhu
- School of Science, Jiangnan University, Wuxi, China.
| | - Jie Gao
- School of Science, Jiangnan University, Wuxi, China.
| |
Collapse
|
7
|
Li W, Fontanelli O, Miramontes P. Size distribution of function-based human gene sets and the split-merge model. ROYAL SOCIETY OPEN SCIENCE 2016; 3:160275. [PMID: 27853602 PMCID: PMC5108952 DOI: 10.1098/rsos.160275] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/22/2016] [Accepted: 07/01/2016] [Indexed: 06/06/2023]
Abstract
The sizes of paralogues-gene families produced by ancestral duplication-are known to follow a power-law distribution. We examine the size distribution of gene sets or gene families where genes are grouped by a similar function or share a common property. The size distribution of Human Gene Nomenclature Committee (HGNC) gene sets deviate from the power-law, and can be fitted much better by a beta rank function. We propose a simple mechanism to break a power-law size distribution by a combination of splitting and merging operations. The largest gene sets are split into two to account for the subfunctional categories, and a small proportion of other gene sets are merged into larger sets as new common themes might be realized. These operations are not uncommon for a curator of gene sets. A simulation shows that iteration of these operations changes the size distribution of Ensembl paralogues and could lead to a distribution fitted by a rank beta function. We further illustrate application of beta rank function by the example of distribution of transcription factors and drug target genes among HGNC gene families.
Collapse
Affiliation(s)
- Wentian Li
- The Robert S. Boas Center for Genomics and Human Genetics, The Feinstein Institute for Medical Research, Northwell Health, Manhasset, NY, USA
| | - Oscar Fontanelli
- Departamento de Matemáticas, Facultad de Ciencias, Universidad Nacional Autónoma de México, Circuito Exterior, Ciudad Universitaria, México 04510 DF, México
| | - Pedro Miramontes
- Departamento de Matemáticas, Facultad de Ciencias, Universidad Nacional Autónoma de México, Circuito Exterior, Ciudad Universitaria, México 04510 DF, México
- Bioinformatics Group and Interdisciplinary Center for Bioinformatics, University of Leipzig, Haertelstrasse 16–18, 04107 Leipzig, Germany
| |
Collapse
|
8
|
Developing of the Computer Method for Annotation of Bacterial Genes. Adv Bioinformatics 2016; 2015:635437. [PMID: 26770195 PMCID: PMC4684837 DOI: 10.1155/2015/635437] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2015] [Revised: 11/16/2015] [Accepted: 11/18/2015] [Indexed: 02/07/2023] Open
Abstract
Over the last years a great number of bacterial genomes were sequenced. Now one of the most important challenges of computational genomics is the functional annotation of nucleic acid sequences. In this study we presented the computational method and the annotation system for predicting biological functions using phylogenetic profiles. The phylogenetic profile of a gene was created by way of searching for similarities between the nucleotide sequence of the gene and 1204 reference genomes, with further estimation of the statistical significance of found similarities. The profiles of the genes with known functions were used for prediction of possible functions and functional groups for the new genes. We conducted the functional annotation for genes from 104 bacterial genomes and compared the functions predicted by our system with the already known functions. For the genes that have already been annotated, the known function matched the function we predicted in 63% of the time, and in 86% of the time the known function was found within the top five predicted functions. Besides, our system increased the share of annotated genes by 19%. The developed system may be used as an alternative or complementary system to the current annotation systems.
Collapse
|
9
|
Hernández-Lemus E, Li W, Meyer P. Advances in systems biology--New trends and perspectives. Comput Biol Chem 2015; 59 Pt B:1-2. [PMID: 26364255 DOI: 10.1016/j.compbiolchem.2015.09.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Affiliation(s)
- Enrique Hernández-Lemus
- Computational Genomics Department, National Institute of Genomic Medicine (INMEGEN), and Center for Complexity Sciences, National Autonomous University of México (UNAM), Mexico.
| | - Wentian Li
- Robert S. Boas Center for Genomics and Human Genetics, The Feinstein Institute for Medical Research, USA.
| | - Pablo Meyer
- Translational Systems Biology and Nanobiotechnology Group, IBM T.J. Watson Research Center, USA.
| |
Collapse
|