51
|
Esposito F, Gillis N, Del Buono N. Orthogonal joint sparse NMF for microarray data analysis. J Math Biol 2019; 79:223-247. [PMID: 31004215 DOI: 10.1007/s00285-019-01355-2] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2018] [Revised: 03/29/2019] [Indexed: 12/20/2022]
Abstract
The 3D microarrays, generally known as gene-sample-time microarrays, couple the information on different time points collected by 2D microarrays that measure gene expression levels among different samples. Their analysis is useful in several biomedical applications, like monitoring dose or drug treatment responses of patients over time in pharmacogenomics studies. Many statistical and data analysis tools have been used to extract useful information. In particular, nonnegative matrix factorization (NMF), with its natural nonnegativity constraints, has demonstrated its ability to extract from 2D microarrays relevant information on specific genes involved in the particular biological process. In this paper, we propose a new NMF model, namely Orthogonal Joint Sparse NMF, to extract relevant information from 3D microarrays containing the time evolution of a 2D microarray, by adding additional constraints to enforce important biological proprieties useful for further biological analysis. We develop multiplicative updates rules that decrease the objective function monotonically, and compare our approach to state-of-the-art NMF algorithms on both synthetic and real data sets.
Collapse
Affiliation(s)
- Flavia Esposito
- Department of Mathematics, University of Bari Aldo Moro, via E. Orabona 4, 70125, Bari, Italy. .,INDAM Research Group GNCS, Roma, Italy.
| | - Nicolas Gillis
- Department of Mathematics and Operational Research, Université de Mons, Rue de Houdain 9, 7000, Mons, Belgium
| | - Nicoletta Del Buono
- Department of Mathematics, University of Bari Aldo Moro, via E. Orabona 4, 70125, Bari, Italy.,INDAM Research Group GNCS, Roma, Italy
| |
Collapse
|
52
|
Wang K, Porter MD. Optimal Bayesian clustering using non-negative matrix factorization. Comput Stat Data Anal 2018. [DOI: 10.1016/j.csda.2018.08.002] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
53
|
Sanfilippo P, Wen J, Lai EC. Landscape and evolution of tissue-specific alternative polyadenylation across Drosophila species. Genome Biol 2017; 18:229. [PMID: 29191225 PMCID: PMC5707805 DOI: 10.1186/s13059-017-1358-0] [Citation(s) in RCA: 45] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2017] [Accepted: 11/08/2017] [Indexed: 12/19/2022] Open
Abstract
BACKGROUND Drosophila melanogaster has one of best-described transcriptomes of any multicellular organism. Nevertheless, the paucity of 3'-sequencing data in this species precludes comprehensive assessment of alternative polyadenylation (APA), which is subject to broad tissue-specific control. RESULTS Here, we generate deep 3'-sequencing data from 23 developmental stages, tissues, and cell lines of D. melanogaster, yielding a comprehensive atlas of ~ 62,000 polyadenylated ends. These data broadly extend the annotated transcriptome, identify ~ 40,000 novel 3' termini, and reveal that two-thirds of Drosophila genes are subject to APA. Furthermore, we dramatically expand the numbers of genes known to be subject to tissue-specific APA, such as 3' untranslated region (UTR) lengthening in head and 3' UTR shortening in testis, and characterize new tissue and developmental 3' UTR patterns. Our thorough 3' UTR annotations permit reassessment of post-transcriptional regulatory networks, via conserved miRNA and RNA binding protein sites. To evaluate the evolutionary conservation and divergence of APA patterns, we generate developmental and tissue-specific 3'-seq libraries from Drosophila yakuba and Drosophila virilis. We document broadly analogous tissue-specific APA trends in these species, but also observe significant alterations in 3' end usage across orthologs. We exploit the population of functionally evolving poly(A) sites to gain clear evidence that evolutionary divergence in core polyadenylation signal (PAS) and downstream sequence element (DSE) motifs drive broad alterations in 3' UTR isoform expression across the Drosophila phylogeny. CONCLUSIONS These data provide a critical resource for the Drosophila community and offer many insights into the complex control of alternative tissue-specific 3' UTR formation and its consequences for post-transcriptional regulatory networks.
Collapse
Affiliation(s)
- Piero Sanfilippo
- Department of Developmental Biology, Sloan-Kettering Institute, New York, New York, 10065, USA
- Louis V. Gerstner, Jr. Graduate School of Biomedical Sciences, Memorial Sloan Kettering Cancer Center, New York, New York, 10065, USA
| | - Jiayu Wen
- Department of Developmental Biology, Sloan-Kettering Institute, New York, New York, 10065, USA
- Present address: Biochemistry and Biomedical Sciences, Research School of Biology, ANU College of Science, The Australian National University, Canberra, ACT 2601, Australia
| | - Eric C Lai
- Department of Developmental Biology, Sloan-Kettering Institute, New York, New York, 10065, USA.
- Louis V. Gerstner, Jr. Graduate School of Biomedical Sciences, Memorial Sloan Kettering Cancer Center, New York, New York, 10065, USA.
| |
Collapse
|
54
|
Heerah K, Woillez M, Fablet R, Garren F, Martin S, De Pontual H. Coupling spectral analysis and hidden Markov models for the segmentation of behavioural patterns. MOVEMENT ECOLOGY 2017; 5:20. [PMID: 28944062 PMCID: PMC5609058 DOI: 10.1186/s40462-017-0111-3] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/21/2016] [Accepted: 09/04/2017] [Indexed: 06/07/2023]
Abstract
BACKGROUND Movement pattern variations are reflective of behavioural switches, likely associated with different life history traits in response to the animals' abiotic and biotic environment. Detecting these can provide rich information on the underlying processes driving animal movement patterns. However, extracting these signals from movement time series, requires tools that objectively extract, describe and quantify these behaviours. The inference of behavioural modes from movement patterns has been mainly addressed through hidden Markov models. Until now, the metrics implemented in these models did not allow to characterize cyclic patterns directly from the raw time series. To address these challenges, we developed an approach to i) extract new metrics of cyclic behaviours and activity levels from a time-frequency analysis of movement time series, ii) implement the spectral signatures of these cyclic patterns and activity levels into a HMM framework to identify and classify latent behavioural states. RESULTS To illustrate our approach, we applied it to 40 high-resolution European sea bass depth time series. Our results showed that the fish had different activity regimes, which were also associated (or not) with the spectral signature of different environmental cycles. Tidal rhythms were observed when animals tended to be less active and dived shallower. Conversely, animals exhibited a diurnal behaviour when more active and deeper in the water column. The different behaviours were well defined and occurred at similar periods throughout the annual cycle amongst individuals, suggesting these behaviours are likely related to seasonal functional behaviours (e.g. feeding, migrating and spawning). CONCLUSIONS The innovative aspects of our method lie within the combined use of powerful, but generic, mathematical tools (spectral analysis and hidden Markov Models) to extract complex behaviours from 1-D movement time series. It is fully automated which makes it suitable for analyzing large datasets. HMMs also offer the flexibility to include any additional variable in the segmentation process (e.g. environmental features, location coordinates). Thus, our method could be widely applied in the bio-logging community and contribute to prime issues in movement ecology (e.g. habitat requirements and selection, site fidelity and dispersal) that are crucial to inform mitigation, management and conservation strategies.
Collapse
Affiliation(s)
- Karine Heerah
- Ifremer, Sciences et Technologies Halieutiques, 10070, 29280 Plouzané, CS France
| | - Mathieu Woillez
- Ifremer, Sciences et Technologies Halieutiques, 10070, 29280 Plouzané, CS France
| | - Ronan Fablet
- IMT Atlantique, University Bretagne Loire, 29238 Brest, France
| | - François Garren
- Ifremer, Sciences et Technologies Halieutiques, 10070, 29280 Plouzané, CS France
| | - Stéphane Martin
- Ifremer, Sciences et Technologies Halieutiques, 10070, 29280 Plouzané, CS France
| | - Hélène De Pontual
- Ifremer, Sciences et Technologies Halieutiques, 10070, 29280 Plouzané, CS France
| |
Collapse
|
55
|
Integrative clustering of multi-level 'omic data based on non-negative matrix factorization algorithm. PLoS One 2017; 12:e0176278. [PMID: 28459819 PMCID: PMC5411077 DOI: 10.1371/journal.pone.0176278] [Citation(s) in RCA: 100] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2016] [Accepted: 04/07/2017] [Indexed: 11/30/2022] Open
Abstract
Integrative analyses of high-throughput ‘omic data, such as DNA methylation, DNA copy number alteration, mRNA and protein expression levels, have created unprecedented opportunities to understand the molecular basis of human disease. In particular, integrative analyses have been the cornerstone in the study of cancer to determine molecular subtypes within a given cancer. As malignant tumors with similar morphological characteristics have been shown to exhibit entirely different molecular profiles, there has been significant interest in using multiple ‘omic data for the identification of novel molecular subtypes of disease, which could impact treatment decisions. Therefore, we have developed intNMF, an integrative approach for disease subtype classification based on non-negative matrix factorization. The proposed approach carries out integrative clustering of multiple high dimensional molecular data in a single comprehensive analysis utilizing the information across multiple biological levels assessed on the same individual. As intNMF does not assume any distributional form for the data, it has obvious advantages over other model based clustering methods which require specific distributional assumptions. Application of intNMF is illustrated using both simulated and real data from The Cancer Genome Atlas (TCGA).
Collapse
|
56
|
Shao C, Höfer T. Robust classification of single-cell transcriptome data by nonnegative matrix factorization. Bioinformatics 2016; 33:235-242. [DOI: 10.1093/bioinformatics/btw607] [Citation(s) in RCA: 76] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2016] [Revised: 09/15/2016] [Accepted: 09/16/2016] [Indexed: 11/14/2022] Open
|
57
|
Atomic connectomics signatures for characterization and differentiation of mild cognitive impairment. Brain Imaging Behav 2016; 9:663-77. [PMID: 25355371 DOI: 10.1007/s11682-014-9320-1] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
In recent years, functional connectomics signatures have been shown to be a very valuable tool in characterizing and differentiating brain disorders from normal controls. However, if the functional connectivity alterations in a brain disease are localized within sub-networks of a connectome, then accurate identification of such disease-specific sub-networks is critical and this capability entails both fine-granularity definition of connectome nodes and effective clustering of connectome nodes into disease-specific and non-disease-specific sub-networks. In this work, we adopted the recently developed DICCCOL (dense individualized and common connectivity-based cortical landmarks) system as a fine-granularity high-resolution connectome construction method to deal with the first issue, and employed an effective variant of non-negative matrix factorization (NMF) method to pinpoint disease-specific sub-networks, which we called atomic connectomics signatures in this work. We have implemented and applied this novel framework to two mild cognitive impairment (MCI) datasets from two different research centers, and our experimental results demonstrated that the derived atomic connectomics signatures can effectively characterize and differentiate MCI patients from their normal controls. In general, our work contributed a novel computational framework for deriving descriptive and distinctive atomic connectomics signatures in brain disorders.
Collapse
|
58
|
Rosales RA, Drummond RD, Valieris R, Dias-Neto E, da Silva IT. signeR: an empirical Bayesian approach to mutational signature discovery. Bioinformatics 2016; 33:8-16. [DOI: 10.1093/bioinformatics/btw572] [Citation(s) in RCA: 72] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2016] [Revised: 08/11/2016] [Accepted: 08/26/2016] [Indexed: 11/14/2022] Open
|
59
|
New integrative computational approaches unveil the Saccharomyces cerevisiae pheno-metabolomic fermentative profile and allow strain selection for winemaking. Food Chem 2016; 211:509-20. [PMID: 27283661 DOI: 10.1016/j.foodchem.2016.05.080] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2015] [Revised: 04/10/2016] [Accepted: 05/12/2016] [Indexed: 11/23/2022]
Abstract
During must fermentation by Saccharomyces cerevisiae strains thousands of volatile aroma compounds are formed. The objective of the present work was to adapt computational approaches to analyze pheno-metabolomic diversity of a S. cerevisiae strain collection with different origins. Phenotypic and genetic characterization together with individual must fermentations were performed, and metabolites relevant to aromatic profiles were determined. Experimental results were projected onto a common coordinates system, revealing 17 statistical-relevant multi-dimensional modules, combining sets of most-correlated features of noteworthy biological importance. The present method allowed, as a breakthrough, to combine genetic, phenotypic and metabolomic data, which has not been possible so far due to difficulties in comparing different types of data. Therefore, the proposed computational approach revealed as successful to shed light into the holistic characterization of S. cerevisiae pheno-metabolome in must fermentative conditions. This will allow the identification of combined relevant features with application in selection of good winemaking strains.
Collapse
|
60
|
Stražar M, Žitnik M, Zupan B, Ule J, Curk T. Orthogonal matrix factorization enables integrative analysis of multiple RNA binding proteins. Bioinformatics 2016; 32:1527-35. [PMID: 26787667 PMCID: PMC4894278 DOI: 10.1093/bioinformatics/btw003] [Citation(s) in RCA: 77] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2015] [Accepted: 01/01/2016] [Indexed: 12/15/2022] Open
Abstract
Motivation: RNA binding proteins (RBPs) play important roles in post-transcriptional control of gene expression, including splicing, transport, polyadenylation and RNA stability. To model protein–RNA interactions by considering all available sources of information, it is necessary to integrate the rapidly growing RBP experimental data with the latest genome annotation, gene function, RNA sequence and structure. Such integration is possible by matrix factorization, where current approaches have an undesired tendency to identify only a small number of the strongest patterns with overlapping features. Because protein–RNA interactions are orchestrated by multiple factors, methods that identify discriminative patterns of varying strengths are needed. Results: We have developed an integrative orthogonality-regularized nonnegative matrix factorization (iONMF) to integrate multiple data sources and discover non-overlapping, class-specific RNA binding patterns of varying strengths. The orthogonality constraint halves the effective size of the factor model and outperforms other NMF models in predicting RBP interaction sites on RNA. We have integrated the largest data compendium to date, which includes 31 CLIP experiments on 19 RBPs involved in splicing (such as hnRNPs, U2AF2, ELAVL1, TDP-43 and FUS) and processing of 3’UTR (Ago, IGF2BP). We show that the integration of multiple data sources improves the predictive accuracy of retrieval of RNA binding sites. In our study the key predictive factors of protein–RNA interactions were the position of RNA structure and sequence motifs, RBP co-binding and gene region type. We report on a number of protein-specific patterns, many of which are consistent with experimentally determined properties of RBPs. Availability and implementation: The iONMF implementation and example datasets are available at https://github.com/mstrazar/ionmf. Contact: tomaz.curk@fri.uni-lj.si Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Martin Stražar
- University of Ljubljana, Faculty of Computer and Information Science, Ljubljana, SI 1000, Slovenia
| | - Marinka Žitnik
- University of Ljubljana, Faculty of Computer and Information Science, Ljubljana, SI 1000, Slovenia
| | - Blaž Zupan
- University of Ljubljana, Faculty of Computer and Information Science, Ljubljana, SI 1000, Slovenia Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Jernej Ule
- Department of Molecular Neuroscience, UCL Institute of Neurology, Queen Square, London WC1N 3BG, UK
| | - Tomaž Curk
- University of Ljubljana, Faculty of Computer and Information Science, Ljubljana, SI 1000, Slovenia
| |
Collapse
|
61
|
Gligorijević V, Pržulj N. Methods for biological data integration: perspectives and challenges. J R Soc Interface 2015; 12:20150571. [PMID: 26490630 PMCID: PMC4685837 DOI: 10.1098/rsif.2015.0571] [Citation(s) in RCA: 144] [Impact Index Per Article: 14.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2015] [Accepted: 09/25/2015] [Indexed: 12/17/2022] Open
Abstract
Rapid technological advances have led to the production of different types of biological data and enabled construction of complex networks with various types of interactions between diverse biological entities. Standard network data analysis methods were shown to be limited in dealing with such heterogeneous networked data and consequently, new methods for integrative data analyses have been proposed. The integrative methods can collectively mine multiple types of biological data and produce more holistic, systems-level biological insights. We survey recent methods for collective mining (integration) of various types of networked biological data. We compare different state-of-the-art methods for data integration and highlight their advantages and disadvantages in addressing important biological problems. We identify the important computational challenges of these methods and provide a general guideline for which methods are suited for specific biological problems, or specific data types. Moreover, we propose that recent non-negative matrix factorization-based approaches may become the integration methodology of choice, as they are well suited and accurate in dealing with heterogeneous data and have many opportunities for further development.
Collapse
Affiliation(s)
| | - Nataša Pržulj
- Department of Computing, Imperial College London, London SW7 2AZ, UK
| |
Collapse
|
62
|
Biswas AK, Kang M, Kim DC, Ding CHQ, Zhang B, Wu X, Gao JX. Inferring disease associations of the long non-coding RNAs through non-negative matrix factorization. ACTA ACUST UNITED AC 2015. [DOI: 10.1007/s13721-015-0081-6] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
|
63
|
Mejía-Roa E, Tabas-Madrid D, Setoain J, García C, Tirado F, Pascual-Montano A. NMF-mGPU: non-negative matrix factorization on multi-GPU systems. BMC Bioinformatics 2015; 16:43. [PMID: 25887585 PMCID: PMC4339678 DOI: 10.1186/s12859-015-0485-4] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2014] [Accepted: 01/30/2015] [Indexed: 01/11/2023] Open
Abstract
BACKGROUND In the last few years, the Non-negative Matrix Factorization ( NMF ) technique has gained a great interest among the Bioinformatics community, since it is able to extract interpretable parts from high-dimensional datasets. However, the computing time required to process large data matrices may become impractical, even for a parallel application running on a multiprocessors cluster. In this paper, we present NMF-mGPU, an efficient and easy-to-use implementation of the NMF algorithm that takes advantage of the high computing performance delivered by Graphics-Processing Units ( GPUs ). Driven by the ever-growing demands from the video-games industry, graphics cards usually provided in PCs and laptops have evolved from simple graphics-drawing platforms into high-performance programmable systems that can be used as coprocessors for linear-algebra operations. However, these devices may have a limited amount of on-board memory, which is not considered by other NMF implementations on GPU. RESULTS NMF-mGPU is based on CUDA ( Compute Unified Device Architecture ), the NVIDIA's framework for GPU computing. On devices with low memory available, large input matrices are blockwise transferred from the system's main memory to the GPU's memory, and processed accordingly. In addition, NMF-mGPU has been explicitly optimized for the different CUDA architectures. Finally, platforms with multiple GPUs can be synchronized through MPI ( Message Passing Interface ). In a four-GPU system, this implementation is about 120 times faster than a single conventional processor, and more than four times faster than a single GPU device (i.e., a super-linear speedup). CONCLUSIONS Applications of GPUs in Bioinformatics are getting more and more attention due to their outstanding performance when compared to traditional processors. In addition, their relatively low price represents a highly cost-effective alternative to conventional clusters. In life sciences, this results in an excellent opportunity to facilitate the daily work of bioinformaticians that are trying to extract biological meaning out of hundreds of gigabytes of experimental information. NMF-mGPU can be used "out of the box" by researchers with little or no expertise in GPU programming in a variety of platforms, such as PCs, laptops, or high-end GPU clusters. NMF-mGPU is freely available at https://github.com/bioinfo-cnb/bionmf-gpu .
Collapse
Affiliation(s)
- Edgardo Mejía-Roa
- ArTeCS Group, Department of Computer Architecture, Complutense University of Madrid (UCM), Madrid, 28040, Spain.
| | - Daniel Tabas-Madrid
- Functional Bioinformatics Group, Biocomputing Unit, National Center for Biotechnology-CSIC, UAM, Madrid, 28049, Spain.
| | - Javier Setoain
- Functional Bioinformatics Group, Biocomputing Unit, National Center for Biotechnology-CSIC, UAM, Madrid, 28049, Spain.
| | - Carlos García
- ArTeCS Group, Department of Computer Architecture, Complutense University of Madrid (UCM), Madrid, 28040, Spain.
| | - Francisco Tirado
- ArTeCS Group, Department of Computer Architecture, Complutense University of Madrid (UCM), Madrid, 28040, Spain.
| | - Alberto Pascual-Montano
- Functional Bioinformatics Group, Biocomputing Unit, National Center for Biotechnology-CSIC, UAM, Madrid, 28049, Spain.
| |
Collapse
|
64
|
Abstract
Motivation: Recently, a shift was made from using Gene Ontology (GO) to evaluate molecular network data to using these data to construct and evaluate GO. Dutkowski et al. provide the first evidence that a large part of GO can be reconstructed solely from topologies of molecular networks. Motivated by this work, we develop a novel data integration framework that integrates multiple types of molecular network data to reconstruct and update GO. We ask how much of GO can be recovered by integrating various molecular interaction data. Results: We introduce a computational framework for integration of various biological networks using penalized non-negative matrix tri-factorization (PNMTF). It takes all network data in a matrix form and performs simultaneous clustering of genes and GO terms, inducing new relations between genes and GO terms (annotations) and between GO terms themselves. To improve the accuracy of our predicted relations, we extend the integration methodology to include additional topological information represented as the similarity in wiring around non-interacting genes. Surprisingly, by integrating topologies of bakers’ yeasts protein–protein interaction, genetic interaction (GI) and co-expression networks, our method reports as related 96% of GO terms that are directly related in GO. The inclusion of the wiring similarity of non-interacting genes contributes 6% to this large GO term association capture. Furthermore, we use our method to infer new relationships between GO terms solely from the topologies of these networks and validate 44% of our predictions in the literature. In addition, our integration method reproduces 48% of cellular component, 41% of molecular function and 41% of biological process GO terms, outperforming the previous method in the former two domains of GO. Finally, we predict new GO annotations of yeast genes and validate our predictions through GIs profiling. Availability and implementation: Supplementary Tables of new GO term associations and predicted gene annotations are available at http://bio-nets.doc.ic.ac.uk/GO-Reconstruction/. Contact:natasha@imperial.ac.uk Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Vuk Janjić
- Department of Computing, Imperial College London SW7 2AZ, UK
| | - Nataša Pržulj
- Department of Computing, Imperial College London SW7 2AZ, UK
| |
Collapse
|
65
|
Žitnik M, Zupan B. Data Fusion by Matrix Factorization. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2015; 37:41-53. [PMID: 26353207 DOI: 10.1109/tpami.2014.2343973] [Citation(s) in RCA: 102] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
For most problems in science and engineering we can obtain data sets that describe the observed system from various perspectives and record the behavior of its individual components. Heterogeneous data sets can be collectively mined by data fusion. Fusion can focus on a specific target relation and exploit directly associated data together with contextual data and data about system's constraints. In the paper we describe a data fusion approach with penalized matrix tri-factorization (DFMF) that simultaneously factorizes data matrices to reveal hidden associations. The approach can directly consider any data that can be expressed in a matrix, including those from feature-based representations, ontologies, associations and networks. We demonstrate the utility of DFMF for gene function prediction task with eleven different data sources and for prediction of pharmacologic actions by fusing six data sources. Our data fusion algorithm compares favorably to alternative data integration approaches and achieves higher accuracy than can be obtained from any single data source alone.
Collapse
|
66
|
Sotiras A, Resnick SM, Davatzikos C. Finding imaging patterns of structural covariance via Non-Negative Matrix Factorization. Neuroimage 2014; 108:1-16. [PMID: 25497684 DOI: 10.1016/j.neuroimage.2014.11.045] [Citation(s) in RCA: 103] [Impact Index Per Article: 9.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2014] [Revised: 11/13/2014] [Accepted: 11/18/2014] [Indexed: 01/12/2023] Open
Abstract
In this paper, we investigate the use of Non-Negative Matrix Factorization (NNMF) for the analysis of structural neuroimaging data. The goal is to identify the brain regions that co-vary across individuals in a consistent way, hence potentially being part of underlying brain networks or otherwise influenced by underlying common mechanisms such as genetics and pathologies. NNMF offers a directly data-driven way of extracting relatively localized co-varying structural regions, thereby transcending limitations of Principal Component Analysis (PCA), Independent Component Analysis (ICA) and other related methods that tend to produce dispersed components of positive and negative loadings. In particular, leveraging upon the well known ability of NNMF to produce parts-based representations of image data, we derive decompositions that partition the brain into regions that vary in consistent ways across individuals. Importantly, these decompositions achieve dimensionality reduction via highly interpretable ways and generalize well to new data as shown via split-sample experiments. We empirically validate NNMF in two data sets: i) a Diffusion Tensor (DT) mouse brain development study, and ii) a structural Magnetic Resonance (sMR) study of human brain aging. We demonstrate the ability of NNMF to produce sparse parts-based representations of the data at various resolutions. These representations seem to follow what we know about the underlying functional organization of the brain and also capture some pathological processes. Moreover, we show that these low dimensional representations favorably compare to descriptions obtained with more commonly used matrix factorization methods like PCA and ICA.
Collapse
Affiliation(s)
- Aristeidis Sotiras
- Section for Biomedical Image Analysis, Center for Biomedical Image Computing and Analytics, University of Pennsylvania, Philadelphia, PA 19104, USA.
| | - Susan M Resnick
- Laboratory of Behavioral Neuroscience, National Institute on Aging, Baltimore, MD 21224, USA
| | - Christos Davatzikos
- Section for Biomedical Image Analysis, Center for Biomedical Image Computing and Analytics, University of Pennsylvania, Philadelphia, PA 19104, USA
| |
Collapse
|
67
|
Ou J, Lian Z, Xie L, Li X, Wang P, Hao Y, Zhu D, Jiang R, Wang Y, Chen Y, Zhang J, Liu T. Atomic dynamic functional interaction patterns for characterization of ADHD. Hum Brain Mapp 2014; 35:5262-78. [PMID: 24861961 DOI: 10.1002/hbm.22548] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2013] [Revised: 03/07/2014] [Accepted: 05/05/2014] [Indexed: 11/08/2022] Open
Abstract
Modeling abnormal temporal dynamics of functional interactions in psychiatric disorders has been of great interest in the neuroimaging field, and thus a variety of methods have been proposed so far. However, the temporal dynamics and disease-related abnormalities of functional interactions within specific data-driven discovered subnetworks have been rarely explored yet. In this work, we propose a novel computational framework composed of an effective Bayesian connectivity change point model for modeling functional brain interactions and their dynamics simultaneously and an effective variant of nonnegative matrix factorization for assessing the functional interaction abnormalities within subnetworks. This framework has been applied on the resting state fmagnetic resonance imaging (fMRI) datasets of 23 children with attention-deficit/hyperactivity disorder (ADHD) and 45 normal control (NC) children, and has revealed two atomic functional interaction patterns (AFIPs) discovered for ADHD and another two AFIPs derived for NC. Together, these four AFIPs could be grouped into two pairs, one common pair representing the common AFIPs in ADHD and NC, and the other abnormal pair representing the abnormal AFIPs in ADHD. Interestingly, by comparing the abnormal AFIP pair, two data-driven abnormal functional subnetworks are derived. Strikingly, by evaluating the approximation based on the four AFIPs, all of the ADHD children were successfully differentiated from NCs without any false positive.
Collapse
Affiliation(s)
- Jinli Ou
- School of Biomedical Engineering & Instrument Science, Zhejiang University, Hangzhou, China
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
68
|
How Many Topics? Stability Analysis for Topic Models. MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES 2014. [DOI: 10.1007/978-3-662-44848-9_32] [Citation(s) in RCA: 65] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/02/2022]
|
69
|
Abstract
Systemic response to DNA damage and other stresses is a complex process that includes changes in the regulation and activity of nearly all stages of gene expression. One gene regulatory mechanism used by eukaryotes is selection among alternative transcript isoforms that differ in polyadenylation [poly(A)] sites, resulting in changes either to the coding sequence or to portions of the 3' UTR that govern translation, stability, and localization. To determine the extent to which this means of regulation is used in response to DNA damage, we conducted a global analysis of poly(A) site usage in Saccharomyces cerevisiae after exposure to the UV mimetic, 4-nitroquinoline 1-oxide (4NQO). Two thousand thirty-one genes were found to have significant variation in poly(A) site distributions following 4NQO treatment, with a strong bias toward loss of short transcripts, including many with poly(A) sites located within the protein coding sequence (CDS). We further explored one possible mechanism that could contribute to the widespread differences in mRNA isoforms. The change in poly(A) site profile was associated with an inhibition of cleavage and polyadenylation in cell extract and a decrease in the levels of several key subunits in the mRNA 3'-end processing complex. Sequence analysis identified differences in the cis-acting elements that flank putatively suppressed and enhanced poly(A) sites, suggesting a mechanism that could discriminate between variable and constitutive poly(A) sites. Our analysis indicates that variation in mRNA length is an important part of the regulatory response to DNA damage.
Collapse
|
70
|
Vandenbon A, Kumagai Y, Teraguchi S, Amada KM, Akira S, Standley DM. A Parzen window-based approach for the detection of locally enriched transcription factor binding sites. BMC Bioinformatics 2013; 14:26. [PMID: 23331723 PMCID: PMC3602658 DOI: 10.1186/1471-2105-14-26] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2012] [Accepted: 01/14/2013] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Identification of cis- and trans-acting factors regulating gene expression remains an important problem in biology. Bioinformatics analyses of regulatory regions are hampered by several difficulties. One is that binding sites for regulatory proteins are often not significantly over-represented in the set of DNA sequences of interest, because of high levels of false positive predictions, and because of positional restrictions on functional binding sites with regard to the transcription start site. RESULTS We have developed a novel method for the detection of regulatory motifs based on their local over-representation in sets of regulatory regions. The method makes use of a Parzen window-based approach for scoring local enrichment, and during evaluation of significance it takes into account GC content of sequences. We show that the accuracy of our method compares favourably to that of other methods, and that our method is capable of detecting not only generally over-represented regulatory motifs, but also locally over-represented motifs that are often missed by standard motif detection approaches. Using a number of examples we illustrate the validity of our approach and suggest applications, such as the analysis of weaker binding sites. CONCLUSIONS Our approach can be used to suggest testable hypotheses for wet-lab experiments. It has potential for future analyses, such as the prediction of weaker binding sites. An online application of our approach, called LocaMo Finder (Local Motif Finder), is available at http://sysimm.ifrec.osaka-u.ac.jp/tfbs/locamo/.
Collapse
Affiliation(s)
- Alexis Vandenbon
- Laboratory of Systems Immunology, Immunology Frontier Research Center, Osaka University, Osaka, Japan.
| | | | | | | | | | | |
Collapse
|
71
|
Tian B, Graber JH. Signals for pre-mRNA cleavage and polyadenylation. WILEY INTERDISCIPLINARY REVIEWS-RNA 2011; 3:385-96. [PMID: 22012871 DOI: 10.1002/wrna.116] [Citation(s) in RCA: 172] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/10/2023]
Abstract
Pre-mRNA cleavage and polyadenylation is an essential step for 3' end formation of almost all protein-coding transcripts in eukaryotes. The reaction, involving cleavage of nascent mRNA followed by addition of a polyadenylate or poly(A) tail, is controlled by cis-acting elements in the pre-mRNA surrounding the cleavage site. Experimental and bioinformatic studies in the past three decades have elucidated conserved and divergent elements across eukaryotes, from yeast to human. Here we review histories and current models of these elements in a broad range of species.
Collapse
Affiliation(s)
- Bin Tian
- UMDNJ-New Jersey Medical School, Newark, NJ, USA.
| | | |
Collapse
|
72
|
Gaujoux R, Seoighe C. Semi-supervised Nonnegative Matrix Factorization for gene expression deconvolution: a case study. INFECTION GENETICS AND EVOLUTION 2011; 12:913-21. [PMID: 21930246 DOI: 10.1016/j.meegid.2011.08.014] [Citation(s) in RCA: 87] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/03/2011] [Revised: 08/10/2011] [Accepted: 08/11/2011] [Indexed: 10/17/2022]
Abstract
Heterogeneity in sample composition is an inherent issue in many gene expression studies and, in many cases, should be taken into account in the downstream analysis to enable correct interpretation of the underlying biological processes. Typical examples are infectious diseases or immunology-related studies using blood samples, where, for example, the proportions of lymphocyte sub-populations are expected to vary between cases and controls. Nonnegative Matrix Factorization (NMF) is an unsupervised learning technique that has been applied successfully in several fields, notably in bioinformatics where its ability to extract meaningful information from high-dimensional data such as gene expression microarrays has been demonstrated. Very recently, it has been applied to biomarker discovery and gene expression deconvolution in heterogeneous tissue samples. Being essentially unsupervised, standard NMF methods are not guaranteed to find components corresponding to the cell types of interest in the sample, which may jeopardize the correct estimation of cell proportions. We have investigated the use of prior knowledge, in the form of a set of marker genes, to improve gene expression deconvolution with NMF algorithms. We found that this improves the consistency with which both cell type proportions and cell type gene expression signatures are estimated. The proposed method was tested on a microarray dataset consisting of pure cell types mixed in known proportions. Pearson correlation coefficients between true and estimated cell type proportions improved substantially (typically from about 0.5 to approximately 0.8) with the semi-supervised (marker-guided) versions of commonly used NMF algorithms. Furthermore known marker genes associated with each cell type were assigned to the correct cell type more frequently for the guided versions. We conclude that the use of marker genes improves the accuracy of gene expression deconvolution using NMF and suggest modifications to how the marker gene information is used that may lead to further improvements.
Collapse
Affiliation(s)
- Renaud Gaujoux
- Computational Biology Group, Institute of Infectious Disease and Molecular Medicine, University of Cape Town, South Africa.
| | | |
Collapse
|
73
|
Ho ES, Gunderson SI. Long conserved fragments upstream of Mammalian polyadenylation sites. Genome Biol Evol 2011; 3:654-66. [PMID: 21705472 PMCID: PMC3157836 DOI: 10.1093/gbe/evr053] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/27/2011] [Indexed: 12/02/2022] Open
Abstract
Polyadenylation is a cotranscriptional nuclear RNA processing event involving endonucleolytic cleavage of the nascent, emerging pre-messenger RNA (pre-mRNA) from the RNA polymerase, immediately followed by the polymerization of adenine ribonucleotides, called the poly(A) tail, to the cleaved 3' end of the polyadenylation site (PAS). This apparently simple molecular processing step has been discovered to be connected to transcription and splicing therefore increasing its potential for regulation of gene expression. Here, through a bioinformatic analysis of cis-PAS-regulatory elements in mammals that includes taking advantage of multiple evolutionary time scales, we find unexpected selection pressure much further upstream, up to 200 nt, from the PAS than previously thought. Strikingly, close to 3,000 long (30-500 nt) noncoding conserved fragments (CFs) were discovered in the PAS flanking region of three remotely related mammalian species, human, mouse, and cow. When an even more remote transitional mammal, platypus, was included, still over a thousand CFs were found in the proximity of the PAS. Even though the biological function of these CFs remains unknown, their considerable sizes makes them unlikely to serve as protein recognition sites, which are typically ≤15 nt. By harnessing genome wide DNaseI hypersensitivity data, we have discovered that the presence of CFs correlates with chromatin accessibility. Our study is important in highlighting novel experimental targets, which may provide new understanding about the regulatory aspects of polyadenylation.
Collapse
Affiliation(s)
- Eric S. Ho
- Present address: Department of Molecular Genetics, Microbiology and Immunology, University of Medicine and Dentistry of New Jersey-Robert Wood Johnson Medical School, Piscataway, New Jersey
| | | |
Collapse
|
74
|
Mishra H, Singh N, Misra K, Lahiri T. An ANN-GA model based promoter prediction in Arabidopsis thaliana using tilling microarray data. Bioinformation 2011; 6:240-3. [PMID: 21887014 PMCID: PMC3159145 DOI: 10.6026/97320630006240] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2011] [Accepted: 05/09/2011] [Indexed: 11/23/2022] Open
Abstract
Identification of promoter region is an important part of gene annotation. Identification of promoters in eukaryotes is important as promoters modulate various
metabolic functions and cellular stress responses. In this work, a novel approach utilizing intensity values of tilling microarray data for a model eukaryotic plant
Arabidopsis thaliana, was used to specify promoter region from non-promoter region. A feed-forward back propagation neural network model supported by
genetic algorithm was employed to predict the class of data with a window size of 41. A dataset comprising of 2992 data vectors representing both promoter and
non-promoter regions, chosen randomly from probe intensity vectors for whole genome of Arabidopsis thaliana generated through tilling microarray technique
was used. The classifier model shows prediction accuracy of 69.73% and 65.36% on training and validation sets, respectively. Further, a concept of distance based
class membership was used to validate reliability of classifier, which showed promising results. The study shows the usability of micro-array probe intensities to
predict the promoter regions in eukaryotic genomes.
Collapse
Affiliation(s)
- Hrishikesh Mishra
- Division of Applied Sciences and Indo-Russian Centre for Biotechnology, Indian Institute of Information Technology, Allahabad, India
| | | | | | | |
Collapse
|
75
|
Gaujoux R, Seoighe C. A flexible R package for nonnegative matrix factorization. BMC Bioinformatics 2010; 11:367. [PMID: 20598126 PMCID: PMC2912887 DOI: 10.1186/1471-2105-11-367] [Citation(s) in RCA: 950] [Impact Index Per Article: 63.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2009] [Accepted: 07/02/2010] [Indexed: 11/23/2022] Open
Abstract
Background Nonnegative Matrix Factorization (NMF) is an unsupervised learning technique that has been applied successfully in several fields, including signal processing, face recognition and text mining. Recent applications of NMF in bioinformatics have demonstrated its ability to extract meaningful information from high-dimensional data such as gene expression microarrays. Developments in NMF theory and applications have resulted in a variety of algorithms and methods. However, most NMF implementations have been on commercial platforms, while those that are freely available typically require programming skills. This limits their use by the wider research community. Results Our objective is to provide the bioinformatics community with an open-source, easy-to-use and unified interface to standard NMF algorithms, as well as with a simple framework to help implement and test new NMF methods. For that purpose, we have developed a package for the R/BioConductor platform. The package ports public code to R, and is structured to enable users to easily modify and/or add algorithms. It includes a number of published NMF algorithms and initialization methods and facilitates the combination of these to produce new NMF strategies. Commonly used benchmark data and visualization methods are provided to help in the comparison and interpretation of the results. Conclusions The NMF package helps realize the potential of Nonnegative Matrix Factorization, especially in bioinformatics, providing easy access to methods that have already yielded new insights in many applications. Documentation, source code and sample data are available from CRAN.
Collapse
Affiliation(s)
- Renaud Gaujoux
- Computational Biology Group, Department of Clinical Laboratory Sciences, Faculty of Health Sciences, University of Cape Town, South Africa
| | | |
Collapse
|