1
|
Wu B, Shen L, Zhang T, Ghanem B. MAP Inference Via $$\ell _2$$-Sphere Linear Program Reformulation. Int J Comput Vis 2020. [DOI: 10.1007/s11263-020-01313-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
2
|
Swoboda P, Shekhovtsov A, Kappes JH, Schnorr C, Savchynskyy B. Partial Optimality by Pruning for MAP-Inference with General Graphical Models. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2016; 38:1370-1382. [PMID: 26468978 DOI: 10.1109/tpami.2015.2484327] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
We consider the energy minimization problem for undirected graphical models, also known as MAP-inference problem for Markov random fields which is NP-hard in general. We propose a novel polynomial time algorithm to obtain a part of its optimal non-relaxed integral solution. Our algorithm is initialized with variables taking integral values in the solution of a convex relaxation of the MAP-inference problem and iteratively prunes those, which do not satisfy our criterion for partial optimality. We show that our pruning strategy is in a certain sense theoretically optimal. Also empirically our method outperforms previous approaches in terms of the number of persistently labelled variables. The method is very general, as it is applicable to models with arbitrary factors of an arbitrary order and can employ any solver for the considered relaxed problem. Our method's runtime is determined by the runtime of the convex relaxation solver for the MAP-inference problem.
Collapse
|
3
|
A Comparative Study of Modern Inference Techniques for Structured Discrete Energy Minimization Problems. Int J Comput Vis 2015. [DOI: 10.1007/s11263-015-0809-x] [Citation(s) in RCA: 40] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
4
|
Efroni S, Meerzaman D, Schaefer CF, Greenblum S, Soo-Lyu M, Hu Y, Cultraro C, Meshorer E, Buetow KH. Systems analysis utilising pathway interactions identifies sonic hedgehog pathway as a primary biomarker and oncogenic target in hepatocellular carcinoma. IET Syst Biol 2014; 7:243-51. [PMID: 24712101 DOI: 10.1049/iet-syb.2010.0078] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023] Open
Abstract
The development and progression of cancer is associated with disruption of biological networks. Historically studies have identified sets of signature genes involved in events ultimately leading to the development of cancer. Identification of such sets does not indicate which biologic processes are oncogenic drivers and makes it difficult to identify key networks to target for interventions. Using a comprehensive, integrated computational approach, the authors identify the sonic hedgehog (SHH) pathway as the gene network that most significantly distinguishes tumour and tumour-adjacent samples in human hepatocellular carcinoma (HCC). The analysis reveals that the SHH pathway is commonly activated in the tumour samples and its activity most significantly differentiates tumour from the non-tumour samples. The authors experimentally validate these in silico findings in the same biologic material using Western blot analysis. This analysis reveals that the expression levels of SHH, phosphorylated cyclin B1, and CDK7 levels are much higher in most tumour tissues as compared to normal tissue. It is also shown that siRNA-mediated silencing of SHH gene expression resulted in a significant reduction of cell proliferation in a liver cancer cell line, SNU449 indicating that SHH plays a major role in promoting cell proliferation in liver cancer. The SHH pathway is a key network underpinning HCC aetiology which may guide the development of interventions for this most common form of human liver cancer.
Collapse
|
5
|
An overview of the statistical methods used for inferring gene regulatory networks and protein-protein interaction networks. Adv Bioinformatics 2013; 2013:953814. [PMID: 23509452 PMCID: PMC3594945 DOI: 10.1155/2013/953814] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2012] [Revised: 01/12/2013] [Accepted: 01/17/2013] [Indexed: 11/17/2022] Open
Abstract
The large influx of data from high-throughput genomic and proteomic technologies has encouraged the researchers to seek approaches for understanding the structure of gene regulatory networks and proteomic networks. This work reviews some of the most important statistical methods used for modeling of gene regulatory networks (GRNs) and protein-protein interaction (PPI) networks. The paper focuses on the recent advances in the statistical graphical modeling techniques, state-space representation models, and information theoretic methods that were proposed for inferring the topology of GRNs. It appears that the problem of inferring the structure of PPI networks is quite different from that of GRNs. Clustering and probabilistic graphical modeling techniques are of prime importance in the statistical inference of PPI networks, and some of the recent approaches using these techniques are also reviewed in this paper. Performance evaluation criteria for the approaches used for modeling GRNs and PPI networks are also discussed.
Collapse
|
6
|
Lichtenstein I, Charleston MA, Caetano TS, Gamble JR, Vadas MA. Active subnetwork recovery with a mechanism-dependent scoring function; with application to angiogenesis and organogenesis studies. BMC Bioinformatics 2013; 14:59. [PMID: 23432934 PMCID: PMC3663784 DOI: 10.1186/1471-2105-14-59] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2012] [Accepted: 01/21/2013] [Indexed: 11/10/2022] Open
Abstract
Background The learning active subnetworks problem involves finding subnetworks of a bio-molecular network that are active in a particular condition. Many approaches integrate observation data (e.g., gene expression) with the network topology to find candidate subnetworks. Increasingly, pathway databases contain additional annotation information that can be mined to improve prediction accuracy, e.g., interaction mechanism (e.g., transcription, microRNA, cleavage) annotations. We introduce a mechanism-based approach to active subnetwork recovery which exploits such annotations. We suggest that neighboring interactions in a network tend to be co-activated in a way that depends on the “correlation” of their mechanism annotations. e.g., neighboring phosphorylation and de-phosphorylation interactions may be more likely to be co-activated than neighboring phosphorylation and covalent bonding interactions. Results Our method iteratively learns the mechanism correlations and finds the most likely active subnetwork. We use a probabilistic graphical model with a Markov Random Field component which creates dependencies between the states (active or non-active) of neighboring interactions, that incorporates a mechanism-based component to the function. We apply a heuristic-based EM-based algorithm suitable for the problem. We validated our method’s performance using simulated data in networks downloaded from GeneGO against the same approach without the mechanism-based component, and two other existing methods. We validated our methods performance in correctly recovering (1) the true interaction states, and (2) global network properties of the original network against these other methods. We applied our method to networks generated from time-course gene expression studies in angiogenesis and lung organogenesis and validated the findings from a biological perspective against current literature. Conclusions The advantage of our mechanism-based approach is best seen in networks composed of connected regions with a large number of interactions annotated with a subset of mechanisms, e.g., a regulatory region of transcription interactions, or a cleavage cascade region. When applied to real datasets, our method recovered novel and biologically meaningful putative interactions, e.g., interactions from an integrin signaling pathway using the angiogenesis dataset, and a group of regulatory microRNA interactions in an organogenesis network.
Collapse
Affiliation(s)
- Ilana Lichtenstein
- School of Information Technologies, University of Sydney, Sydney, NSW 2006, Australia.
| | | | | | | | | |
Collapse
|
7
|
Network motifs in the transcriptional regulation network of cervical carcinoma cells respond to EGF. Arch Gynecol Obstet 2012. [PMID: 23188118 DOI: 10.1007/s00404-012-2608-8] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
PURPOSE Cervical carcinoma is the second most prevalent and the fifth most deadly malignancy seen in women worldwide. Dysregulated activation of EGF ErbB system has been implicated in diverse types of human cancer; however, it is elusive how it is regulated in human cervical cancer cells. We herein aimed to explore the mechanisms of cervical carcinoma response to epidermal growth factor (EGF), with a view of the pathways activated by EGF. METHODS Using the GSE6783 affymetrix microarray data accessible from gene expression omnibus database, we first identified the differentially expressed genes between EGF-stimulated and -unstimulated samples. Then we constructed a regulation network and identified the network motifs. We also performed biological process and pathway enrichment analyses to functionally classify the genes in the regulation network. RESULTS A total of 11 network motifs were identified in the regulation network. EGF treatment could increase the risk of cancer via dysregulation of cancer-related pathways and immune response pathways. CONCLUSIONS Network motif analysis is useful in mining the useful information underlying the network. We hope our work could serve as a basis for further experimentation.
Collapse
|
8
|
James K, Wipat A, Hallinan J. Is newer better?--evaluating the effects of data curation on integrated analyses in Saccharomyces cerevisiae. Integr Biol (Camb) 2012; 4:715-27. [PMID: 22526920 DOI: 10.1039/c2ib00123c] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Recent high-throughput experiments have produced a wealth of heterogeneous datasets, each of which provides information about different aspects of the cell. Consequently, integration of diverse data types is essential in order to address many biological questions. The quality of any integrated analysis system is dependent upon the quality of its component data, and upon the Gold Standard data used to evaluate it. It is commonly assumed that the quality of data improves as databases grow and change, particularly for manually curated databases. However, the validity of this assumption can be questioned, given the constant changes in the data coupled with the high level of noise associated with high-throughput experimental techniques. One of the most powerful approaches to data integration is the use of Probabilistic Functional Integrated Networks (PFINs). Here, we systematically analyse the changes in four highly-curated and widely-used online databases and evaluate the extent to which these changes affect the protein function prediction performance of PFINs in the yeast Saccharomyces cerevisiae. We find that the global trend in network performance improves over time. Where individual areas of biology are concerned, however, the most recent files do not always produce the best results. Individual datasets have unique biases towards different biological processes and by selecting and integrating relevant datasets performance can be improved. When using any type of integrated system to answer a specific biological question careful selection of raw data and Gold Standard is vital, since the most recent data may not be the most appropriate.
Collapse
Affiliation(s)
- Katherine James
- School of Computing Science, Newcastle University, Newcastle upon Tyne, NE1 7RU, United Kingdom
| | | | | |
Collapse
|
9
|
Khan A, Majid A, Hayat M. CE-PLoc: an ensemble classifier for predicting protein subcellular locations by fusing different modes of pseudo amino acid composition. Comput Biol Chem 2011; 35:218-29. [PMID: 21864791 DOI: 10.1016/j.compbiolchem.2011.05.003] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2011] [Revised: 05/17/2011] [Accepted: 05/18/2011] [Indexed: 12/18/2022]
Abstract
Precise information about protein locations in a cell facilitates in the understanding of the function of a protein and its interaction in the cellular environment. This information further helps in the study of the specific metabolic pathways and other biological processes. We propose an ensemble approach called "CE-PLoc" for predicting subcellular locations based on fusion of individual classifiers. The proposed approach utilizes features obtained from both dipeptide composition (DC) and amphiphilic pseudo amino acid composition (PseAAC) based feature extraction strategies. Different feature spaces are obtained by varying the dimensionality using PseAAC for a selected base learner. The performance of the individual learning mechanisms such as support vector machine, nearest neighbor, probabilistic neural network, covariant discriminant, which are trained using PseAAC based features is first analyzed. Classifiers are developed using same learning mechanism but trained on PseAAC based feature spaces of varying dimensions. These classifiers are combined through voting strategy and an improvement in prediction performance is achieved. Prediction performance is further enhanced by developing CE-PLoc through the combination of different learning mechanisms trained on both DC based feature space and PseAAC based feature spaces of varying dimensions. The predictive performance of proposed CE-PLoc is evaluated for two benchmark datasets of protein subcellular locations using accuracy, MCC, and Q-statistics. Using the jackknife test, prediction accuracies of 81.47 and 83.99% are obtained for 12 and 14 subcellular locations datasets, respectively. In case of independent dataset test, prediction accuracies are 87.04 and 87.33% for 12 and 14 class datasets, respectively.
Collapse
Affiliation(s)
- Asifullah Khan
- Department of Information and Computer Sciences, Pakistan Institute of Engineering and Applied Sciences, Nilore, Islamabad, Pakistan.
| | | | | |
Collapse
|
10
|
Abstract
In this letter, we develop and simulate a large-scale network of spiking neurons that approximates the inference computations performed by graphical models. Unlike previous related schemes, which used sum and product operations in either the log or linear domains, the current model uses an inference scheme based on the sum and maximization operations in the log domain. Simulations show that using these operations, a large-scale circuit, which combines populations of spiking neurons as basic building blocks, is capable of finding close approximations to the full mathematical computations performed by graphical models within a few hundred milliseconds. The circuit is general in the sense that it can be wired for any graph structure, it supports multistate variables, and it uses standard leaky integrate-and-fire neuronal units. Following previous work, which proposed relations between graphical models and the large-scale cortical anatomy, we focus on the cortical microcircuitry and propose how anatomical and physiological aspects of the local circuitry may map onto elements of the graphical model implementation. We discuss in particular the roles of three major types of inhibitory neurons (small fast-spiking basket cells, large layer 2/3 basket cells, and double-bouquet neurons), subpopulations of strongly interconnected neurons with their unique connectivity patterns in different cortical layers, and the possible role of minicolumns in the realization of the population-based maximum operation.
Collapse
Affiliation(s)
- Shai Litvak
- Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot, Israel.
| | | |
Collapse
|
11
|
Roy S, Martinez D, Platero H, Lane T, Werner-Washburne M. Exploiting amino acid composition for predicting protein-protein interactions. PLoS One 2009; 4:e7813. [PMID: 19936254 PMCID: PMC2775920 DOI: 10.1371/journal.pone.0007813] [Citation(s) in RCA: 52] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2009] [Accepted: 10/15/2009] [Indexed: 11/23/2022] Open
Abstract
Background Computational prediction of protein interactions typically use protein domains as classifier features because they capture conserved information of interaction surfaces. However, approaches relying on domains as features cannot be applied to proteins without any domain information. In this paper, we explore the contribution of pure amino acid composition (AAC) for protein interaction prediction. This simple feature, which is based on normalized counts of single or pairs of amino acids, is applicable to proteins from any sequenced organism and can be used to compensate for the lack of domain information. Results AAC performed at par with protein interaction prediction based on domains on three yeast protein interaction datasets. Similar behavior was obtained using different classifiers, indicating that our results are a function of features and not of classifiers. In addition to yeast datasets, AAC performed comparably on worm and fly datasets. Prediction of interactions for the entire yeast proteome identified a large number of novel interactions, the majority of which co-localized or participated in the same processes. Our high confidence interaction network included both well-studied and uncharacterized proteins. Proteins with known function were involved in actin assembly and cell budding. Uncharacterized proteins interacted with proteins involved in reproduction and cell budding, thus providing putative biological roles for the uncharacterized proteins. Conclusion AAC is a simple, yet powerful feature for predicting protein interactions, and can be used alone or in conjunction with protein domains to predict new and validate existing interactions. More importantly, AAC alone performs at par with existing, but more complex, features indicating the presence of sequence-level information that is predictive of interaction, but which is not necessarily restricted to domains.
Collapse
Affiliation(s)
- Sushmita Roy
- Sushmita Roy Computer Science, University of New Mexico, Albuquerque, New Mexico, United States of America.
| | | | | | | | | |
Collapse
|
12
|
Kashani ZRM, Ahrabian H, Elahi E, Nowzari-Dalini A, Ansari ES, Asadi S, Mohammadi S, Schreiber F, Masoudi-Nejad A. Kavosh: a new algorithm for finding network motifs. BMC Bioinformatics 2009; 10:318. [PMID: 19799800 PMCID: PMC2765973 DOI: 10.1186/1471-2105-10-318] [Citation(s) in RCA: 158] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2009] [Accepted: 10/04/2009] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Complex networks are studied across many fields of science and are particularly important to understand biological processes. Motifs in networks are small connected sub-graphs that occur significantly in higher frequencies than in random networks. They have recently gathered much attention as a useful concept to uncover structural design principles of complex networks. Existing algorithms for finding network motifs are extremely costly in CPU time and memory consumption and have practically restrictions on the size of motifs. RESULTS We present a new algorithm (Kavosh), for finding k-size network motifs with less memory and CPU time in comparison to other existing algorithms. Our algorithm is based on counting all k-size sub-graphs of a given graph (directed or undirected). We evaluated our algorithm on biological networks of E. coli and S. cereviciae, and also on non-biological networks: a social and an electronic network. CONCLUSION The efficiency of our algorithm is demonstrated by comparing the obtained results with three well-known motif finding tools. For comparison, the CPU time, memory usage and the similarities of obtained motifs are considered. Besides, Kavosh can be employed for finding motifs of size greater than eight, while most of the other algorithms have restriction on motifs with size greater than eight. The Kavosh source code and help files are freely available at: http://Lbb.ut.ac.ir/Download/LBBsoft/Kavosh/.
Collapse
Affiliation(s)
- Zahra Razaghi Moghadam Kashani
- Laboratory of Systems Biology and Bioinformatics, Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran.
| | | | | | | | | | | | | | | | | |
Collapse
|
13
|
Kashima H, Yamanishi Y, Kato T, Sugiyama M, Tsuda K. Simultaneous inference of biological networks of multiple species from genome-wide data and evolutionary information: a semi-supervised approach. Bioinformatics 2009; 25:2962-8. [PMID: 19689962 DOI: 10.1093/bioinformatics/btp494] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION The existing supervised methods for biological network inference work on each of the networks individually based only on intra-species information such as gene expression data. We believe that it will be more effective to use genomic data and cross-species evolutionary information from different species simultaneously, rather than to use the genomic data alone. RESULTS We created a new semi-supervised learning method called Link Propagation for inferring biological networks of multiple species based on genome-wide data and evolutionary information. The new method was applied to simultaneous reconstruction of three metabolic networks of Caenorhabditis elegans, Helicobacter pylori and Saccharomyces cerevisiae, based on gene expression similarities and amino acid sequence similarities. The experimental results proved that the new simultaneous network inference method consistently improves the predictive performance over the individual network inferences, and it also outperforms in accuracy and speed other established methods such as the pairwise support vector machine. AVAILABILITY The software and data are available at http://cbio.ensmp.fr/~yyamanishi/LinkPropagation/.
Collapse
Affiliation(s)
- Hisashi Kashima
- IBM Research, Tokyo Research Laboratory, 1623-14 Shimo-tsuruma, Yamato, Kanagawa 242-8502, Japan.
| | | | | | | | | |
Collapse
|
14
|
Huttenhower C, Myers CL, Hibbs MA, Troyanskaya OG. Computational analysis of the yeast proteome: understanding and exploiting functional specificity in genomic data. Methods Mol Biol 2009; 548:273-93. [PMID: 19521830 DOI: 10.1007/978-1-59745-540-4_15] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/25/2023]
Abstract
Modern experimental techniques have produced a wealth of high-throughput data that has enabled the ongoing genomic revolution. As the field continues to integrate experimental and computational analyzes of this data, it is essential that performance evaluations of high-throughput results be carried out in a consistent and biologically informative manner. Here, we present an overview of evaluation techniques for high-throughput experimental data and computational methods, and we discuss a number of potential pitfalls in this process. These primarily involve the biological diversity of genomic data, which can be masked or misrepresented in overly simplified global evaluations. We describe systems for preserving information about biological context during dataset evaluation, which can help to ensure that multiple different evaluations are more directly comparable. This biological variety in high-throughput data can also be taken advantage of computationally through data integration and process specificity to produce richer systems-level predictions of cellular function. An awareness of these considerations can greatly improve the evaluation and analysis of any high-throughput experimental dataset.
Collapse
Affiliation(s)
- Curtis Huttenhower
- Department of Computer Science, Princeton University, Princeton, NJ 08544, USA
| | | | | | | |
Collapse
|
15
|
Leach SM, Tipney H, Feng W, Baumgartner WA, Kasliwal P, Schuyler RP, Williams T, Spritz RA, Hunter L. Biomedical discovery acceleration, with applications to craniofacial development. PLoS Comput Biol 2009; 5:e1000215. [PMID: 19325874 PMCID: PMC2653649 DOI: 10.1371/journal.pcbi.1000215] [Citation(s) in RCA: 55] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2008] [Accepted: 02/12/2009] [Indexed: 01/17/2023] Open
Abstract
The profusion of high-throughput instruments and the explosion of new results in the scientific literature, particularly in molecular biomedicine, is both a blessing and a curse to the bench researcher. Even knowledgeable and experienced scientists can benefit from computational tools that help navigate this vast and rapidly evolving terrain. In this paper, we describe a novel computational approach to this challenge, a knowledge-based system that combines reading, reasoning, and reporting methods to facilitate analysis of experimental data. Reading methods extract information from external resources, either by parsing structured data or using biomedical language processing to extract information from unstructured data, and track knowledge provenance. Reasoning methods enrich the knowledge that results from reading by, for example, noting two genes that are annotated to the same ontology term or database entry. Reasoning is also used to combine all sources into a knowledge network that represents the integration of all sorts of relationships between a pair of genes, and to calculate a combined reliability score. Reporting methods combine the knowledge network with a congruent network constructed from experimental data and visualize the combined network in a tool that facilitates the knowledge-based analysis of that data. An implementation of this approach, called the Hanalyzer, is demonstrated on a large-scale gene expression array dataset relevant to craniofacial development. The use of the tool was critical in the creation of hypotheses regarding the roles of four genes never previously characterized as involved in craniofacial development; each of these hypotheses was validated by further experimental work.
Collapse
Affiliation(s)
- Sonia M. Leach
- Center for Computational Pharmacology, University of Colorado at Denver, Denver, Colorado, United States of America
| | - Hannah Tipney
- Center for Computational Pharmacology, University of Colorado at Denver, Denver, Colorado, United States of America
| | - Weiguo Feng
- Department of Craniofacial Biology, University of Colorado at Denver, Denver, Colorado, United States of America
| | - William A. Baumgartner
- Center for Computational Pharmacology, University of Colorado at Denver, Denver, Colorado, United States of America
| | - Priyanka Kasliwal
- Center for Computational Pharmacology, University of Colorado at Denver, Denver, Colorado, United States of America
| | - Ronald P. Schuyler
- Center for Computational Pharmacology, University of Colorado at Denver, Denver, Colorado, United States of America
| | - Trevor Williams
- Department of Craniofacial Biology, University of Colorado at Denver, Denver, Colorado, United States of America
| | - Richard A. Spritz
- Human Medical Genetics Program, University of Colorado at Denver, Denver, Colorado, United States of America
| | - Lawrence Hunter
- Center for Computational Pharmacology, University of Colorado at Denver, Denver, Colorado, United States of America
- * E-mail:
| |
Collapse
|
16
|
Hibbs MA, Myers CL, Huttenhower C, Hess DC, Li K, Caudy AA, Troyanskaya OG. Directing experimental biology: a case study in mitochondrial biogenesis. PLoS Comput Biol 2009; 5:e1000322. [PMID: 19300515 PMCID: PMC2654405 DOI: 10.1371/journal.pcbi.1000322] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2008] [Accepted: 02/06/2009] [Indexed: 11/25/2022] Open
Abstract
Computational approaches have promised to organize collections of functional genomics data into testable predictions of gene and protein involvement in biological processes and pathways. However, few such predictions have been experimentally validated on a large scale, leaving many bioinformatic methods unproven and underutilized in the biology community. Further, it remains unclear what biological concerns should be taken into account when using computational methods to drive real-world experimental efforts. To investigate these concerns and to establish the utility of computational predictions of gene function, we experimentally tested hundreds of predictions generated from an ensemble of three complementary methods for the process of mitochondrial organization and biogenesis in Saccharomyces cerevisiae. The biological data with respect to the mitochondria are presented in a companion manuscript published in PLoS Genetics (doi:10.1371/journal.pgen.1000407). Here we analyze and explore the results of this study that are broadly applicable for computationalists applying gene function prediction techniques, including a new experimental comparison with 48 genes representing the genomic background. Our study leads to several conclusions that are important to consider when driving laboratory investigations using computational prediction approaches. While most genes in yeast are already known to participate in at least one biological process, we confirm that genes with known functions can still be strong candidates for annotation of additional gene functions. We find that different analysis techniques and different underlying data can both greatly affect the types of functional predictions produced by computational methods. This diversity allows an ensemble of techniques to substantially broaden the biological scope and breadth of predictions. We also find that performing prediction and validation steps iteratively allows us to more completely characterize a biological area of interest. While this study focused on a specific functional area in yeast, many of these observations may be useful in the contexts of other processes and organisms. Genome sequencing has provided us with “parts lists” of genes for many organisms, but many of the biological roles these genes are still unknown. While a great deal of functional genomic data exists, providing information about these genes and their roles, the rate at which these data are leveraged into concrete biological knowledge lags far behind the rate of data generation. Many computational approaches have been developed to generate accurate predictions of gene functions, with the goal of bridging this divide. However, as no large-scale experimental efforts have been based on such approaches, their validity and utility remains unproven. We have performed a study that experimentally evaluates predictions from a combination of three computational function prediction approaches, focusing on mitochondrion-related processes in brewer's yeast as a model system. By using computational predictions to guide our laboratory investigation, we have greatly accelerated the rate at which proteins can be assigned to biological processes. Further, our results demonstrate that in order to achieve the best results, it is important for computational biologists to consider both the underlying data and the algorithmic foundations of the methods used to predict function. Lastly, we demonstrate that iterating through phases of prediction and validation has quickly and extensively expanded our knowledge of mitochondrial biology.
Collapse
Affiliation(s)
- Matthew A. Hibbs
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Carl Icahn Laboratory, Princeton, New Jersey, United States of America
- Department of Computer Science, Princeton University, Princeton, New Jersey, United States of America
| | - Chad L. Myers
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Carl Icahn Laboratory, Princeton, New Jersey, United States of America
- Department of Computer Science, Princeton University, Princeton, New Jersey, United States of America
- Department of Computer Science and Engineering, University of Minnesota, Minneapolis, Minnesota, United States of America
| | - Curtis Huttenhower
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Carl Icahn Laboratory, Princeton, New Jersey, United States of America
- Department of Computer Science, Princeton University, Princeton, New Jersey, United States of America
| | - David C. Hess
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Carl Icahn Laboratory, Princeton, New Jersey, United States of America
| | - Kai Li
- Department of Computer Science, Princeton University, Princeton, New Jersey, United States of America
| | - Amy A. Caudy
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Carl Icahn Laboratory, Princeton, New Jersey, United States of America
| | - Olga G. Troyanskaya
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Carl Icahn Laboratory, Princeton, New Jersey, United States of America
- Department of Computer Science, Princeton University, Princeton, New Jersey, United States of America
- * E-mail:
| |
Collapse
|
17
|
Probabilistic prediction and ranking of human protein-protein interactions. BMC Bioinformatics 2007; 8:239. [PMID: 17615067 PMCID: PMC1939716 DOI: 10.1186/1471-2105-8-239] [Citation(s) in RCA: 83] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2007] [Accepted: 07/05/2007] [Indexed: 11/24/2022] Open
Abstract
Background Although the prediction of protein-protein interactions has been extensively investigated for yeast, few such datasets exist for the far larger proteome in human. Furthermore, it has recently been estimated that the overall average false positive rate of available computational and high-throughput experimental interaction datasets is as high as 90%. Results The prediction of human protein-protein interactions was investigated by combining orthogonal protein features within a probabilistic framework. The features include co-expression, orthology to known interacting proteins and the full-Bayesian combination of subcellular localization, co-occurrence of domains and post-translational modifications. A novel scoring function for local network topology was also investigated. This topology feature greatly enhanced the predictions and together with the full-Bayes combined features, made the largest contribution to the predictions. Using a conservative threshold, our most accurate predictor identifies 37606 human interactions, 32892 (80%) of which are not present in other publicly available large human interaction datasets, thus substantially increasing the coverage of the human interaction map. A subset of the 32892 novel predicted interactions have been independently validated. Comparison of the prediction dataset to other available human interaction datasets estimates the false positive rate of the new method to be below 80% which is competitive with other methods. Since the new method scores and ranks all human protein pairs, smaller subsets of higher quality can be generated thus leading to even lower false positive prediction rates. Conclusion The set of interactions predicted in this work increases the coverage of the human interaction map and will help determine the highest confidence human interactions.
Collapse
|