1
|
Chen W, Cui Y, He Y, Zhao L, Cui R, Liu X, Huang H, Zhang Y, Fan Y, Feng X, Ni K, Jiang T, Han M, Lei Y, Liu M, Meng Y, Chen X, Lu X, Wang D, Wang J, Wang S, Guo L, Chen Q, Ye W. Raffinose degradation-related gene GhAGAL3 was screened out responding to salinity stress through expression patterns of GhAGALs family genes. FRONTIERS IN PLANT SCIENCE 2023; 14:1246677. [PMID: 38192697 PMCID: PMC10773686 DOI: 10.3389/fpls.2023.1246677] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/04/2023] [Accepted: 11/27/2023] [Indexed: 01/10/2024]
Abstract
A-galactosidases (AGALs), the oligosaccharide (RFO) catabolic genes of the raffinose family, play crucial roles in plant growth and development and in adversity stress. They can break down the non-reducing terminal galactose residues of glycolipids and sugar chains. In this study, the whole genome of AGALs was analyzed. Bioinformatics analysis was conducted to analyze members of the AGAL family in Gossypium hirsutum, Gossypium arboreum, Gossypium barbadense, and Gossypium raimondii. Meanwhile, RT-qPCR was carried out to analyze the expression patterns of AGAL family members in different tissues of terrestrial cotton. It was found that a series of environmental factors stimulated the expression of the GhAGAL3 gene. The function of GhAGAL3 was verified through virus-induced gene silencing (VIGS). As a result, GhAGAL3 gene silencing resulted in milder wilting of seedlings than the controls, and a significant increase in the raffinose content in cotton, indicating that GhAGAL3 responded to NaCl stress. The increase in raffinose content improved the tolerance of cotton. Findings in this study lay an important foundation for further research on the role of the GhAGAL3 gene family in the molecular mechanism of abiotic stress resistance in cotton.
Collapse
Affiliation(s)
- Wenhua Chen
- Institute of Cotton Research of Chinese Academy of Agricultural Sciences/Research Base, Anyang Institute of Technology, National Key Laboratory of Cotton Bio-breeding and Integrated Utilization, Anyang, Henan, China
- Engineering Research Centre of Cotton, Ministry of Education/College of Agriculture, Xinjiang Agricultural University, Urumqi, China
| | - Yupeng Cui
- Institute of Cotton Research of Chinese Academy of Agricultural Sciences/Research Base, Anyang Institute of Technology, National Key Laboratory of Cotton Bio-breeding and Integrated Utilization, Anyang, Henan, China
| | - Yunxin He
- Hunan Institute of Cotton Science, Changde, Hunan, China
| | - Lanjie Zhao
- Institute of Cotton Research of Chinese Academy of Agricultural Sciences/Research Base, Anyang Institute of Technology, National Key Laboratory of Cotton Bio-breeding and Integrated Utilization, Anyang, Henan, China
| | - Ruifeng Cui
- Institute of Cotton Research of Chinese Academy of Agricultural Sciences/Research Base, Anyang Institute of Technology, National Key Laboratory of Cotton Bio-breeding and Integrated Utilization, Anyang, Henan, China
| | - Xiaoyu Liu
- Institute of Cotton Research of Chinese Academy of Agricultural Sciences/Research Base, Anyang Institute of Technology, National Key Laboratory of Cotton Bio-breeding and Integrated Utilization, Anyang, Henan, China
| | - Hui Huang
- Institute of Cotton Research of Chinese Academy of Agricultural Sciences/Research Base, Anyang Institute of Technology, National Key Laboratory of Cotton Bio-breeding and Integrated Utilization, Anyang, Henan, China
| | - Yuexin Zhang
- Institute of Cotton Research of Chinese Academy of Agricultural Sciences/Research Base, Anyang Institute of Technology, National Key Laboratory of Cotton Bio-breeding and Integrated Utilization, Anyang, Henan, China
| | - Yapeng Fan
- Institute of Cotton Research of Chinese Academy of Agricultural Sciences/Research Base, Anyang Institute of Technology, National Key Laboratory of Cotton Bio-breeding and Integrated Utilization, Anyang, Henan, China
| | - Xixian Feng
- Institute of Cotton Research of Chinese Academy of Agricultural Sciences/Research Base, Anyang Institute of Technology, National Key Laboratory of Cotton Bio-breeding and Integrated Utilization, Anyang, Henan, China
| | - Kesong Ni
- Institute of Cotton Research of Chinese Academy of Agricultural Sciences/Research Base, Anyang Institute of Technology, National Key Laboratory of Cotton Bio-breeding and Integrated Utilization, Anyang, Henan, China
| | - Tiantian Jiang
- Institute of Cotton Research of Chinese Academy of Agricultural Sciences/Research Base, Anyang Institute of Technology, National Key Laboratory of Cotton Bio-breeding and Integrated Utilization, Anyang, Henan, China
| | - Mingge Han
- Institute of Cotton Research of Chinese Academy of Agricultural Sciences/Research Base, Anyang Institute of Technology, National Key Laboratory of Cotton Bio-breeding and Integrated Utilization, Anyang, Henan, China
| | - Yuqian Lei
- Institute of Cotton Research of Chinese Academy of Agricultural Sciences/Research Base, Anyang Institute of Technology, National Key Laboratory of Cotton Bio-breeding and Integrated Utilization, Anyang, Henan, China
| | - Mengyue Liu
- Institute of Cotton Research of Chinese Academy of Agricultural Sciences/Research Base, Anyang Institute of Technology, National Key Laboratory of Cotton Bio-breeding and Integrated Utilization, Anyang, Henan, China
| | - Yuan Meng
- Institute of Cotton Research of Chinese Academy of Agricultural Sciences/Research Base, Anyang Institute of Technology, National Key Laboratory of Cotton Bio-breeding and Integrated Utilization, Anyang, Henan, China
| | - Xiugui Chen
- Institute of Cotton Research of Chinese Academy of Agricultural Sciences/Research Base, Anyang Institute of Technology, National Key Laboratory of Cotton Bio-breeding and Integrated Utilization, Anyang, Henan, China
| | - Xuke Lu
- Institute of Cotton Research of Chinese Academy of Agricultural Sciences/Research Base, Anyang Institute of Technology, National Key Laboratory of Cotton Bio-breeding and Integrated Utilization, Anyang, Henan, China
| | - Delong Wang
- Institute of Cotton Research of Chinese Academy of Agricultural Sciences/Research Base, Anyang Institute of Technology, National Key Laboratory of Cotton Bio-breeding and Integrated Utilization, Anyang, Henan, China
| | - Junjuan Wang
- Institute of Cotton Research of Chinese Academy of Agricultural Sciences/Research Base, Anyang Institute of Technology, National Key Laboratory of Cotton Bio-breeding and Integrated Utilization, Anyang, Henan, China
| | - Shuai Wang
- Institute of Cotton Research of Chinese Academy of Agricultural Sciences/Research Base, Anyang Institute of Technology, National Key Laboratory of Cotton Bio-breeding and Integrated Utilization, Anyang, Henan, China
| | - Lixue Guo
- Institute of Cotton Research of Chinese Academy of Agricultural Sciences/Research Base, Anyang Institute of Technology, National Key Laboratory of Cotton Bio-breeding and Integrated Utilization, Anyang, Henan, China
| | - Quanjia Chen
- Engineering Research Centre of Cotton, Ministry of Education/College of Agriculture, Xinjiang Agricultural University, Urumqi, China
| | - Wuwei Ye
- Institute of Cotton Research of Chinese Academy of Agricultural Sciences/Research Base, Anyang Institute of Technology, National Key Laboratory of Cotton Bio-breeding and Integrated Utilization, Anyang, Henan, China
- Engineering Research Centre of Cotton, Ministry of Education/College of Agriculture, Xinjiang Agricultural University, Urumqi, China
| |
Collapse
|
2
|
Procopio A, Cesarelli G, Donisi L, Merola A, Amato F, Cosentino C. Combined mechanistic modeling and machine-learning approaches in systems biology - A systematic literature review. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2023; 240:107681. [PMID: 37385142 DOI: 10.1016/j.cmpb.2023.107681] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/17/2023] [Revised: 06/14/2023] [Accepted: 06/14/2023] [Indexed: 07/01/2023]
Abstract
BACKGROUND AND OBJECTIVE Mechanistic-based Model simulations (MM) are an effective approach commonly employed, for research and learning purposes, to better investigate and understand the inherent behavior of biological systems. Recent advancements in modern technologies and the large availability of omics data allowed the application of Machine Learning (ML) techniques to different research fields, including systems biology. However, the availability of information regarding the analyzed biological context, sufficient experimental data, as well as the degree of computational complexity, represent some of the issues that both MMs and ML techniques could present individually. For this reason, recently, several studies suggest overcoming or significantly reducing these drawbacks by combining the above-mentioned two methods. In the wake of the growing interest in this hybrid analysis approach, with the present review, we want to systematically investigate the studies available in the scientific literature in which both MMs and ML have been combined to explain biological processes at genomics, proteomics, and metabolomics levels, or the behavior of entire cellular populations. METHODS Elsevier Scopus®, Clarivate Web of Science™ and National Library of Medicine PubMed® databases were enquired using the queries reported in Table 1, resulting in 350 scientific articles. RESULTS Only 14 of the 350 documents returned by the comprehensive search conducted on the three major online databases met our search criteria, i.e. present a hybrid approach consisting of the synergistic combination of MMs and ML to treat a particular aspect of systems biology. CONCLUSIONS Despite the recent interest in this methodology, from a careful analysis of the selected papers, it emerged how examples of integration between MMs and ML are already present in systems biology, highlighting the great potential of this hybrid approach to both at micro and macro biological scales.
Collapse
Affiliation(s)
- Anna Procopio
- Department of Experimental and Clinical Medicine, Università degli Studi Magna Græcia, Catanzaro, 88100, Italia
| | - Giuseppe Cesarelli
- Department of Electrical Engineering and Information Technology, Università degli Studi di Napoli Federico II, Napoli, 80125, Italy
| | - Leandro Donisi
- Department of Advanced Medical and Surgical Sciences, Università della Campania Luigi Vanvitelli, Napoli, 80138, Italy
| | - Alessio Merola
- Department of Experimental and Clinical Medicine, Università degli Studi Magna Græcia, Catanzaro, 88100, Italia
| | - Francesco Amato
- Department of Electrical Engineering and Information Technology, Università degli Studi di Napoli Federico II, Napoli, 80125, Italy.
| | - Carlo Cosentino
- Department of Experimental and Clinical Medicine, Università degli Studi Magna Græcia, Catanzaro, 88100, Italia.
| |
Collapse
|
3
|
Du J, Wang C, Wang L, Mao S, Zhu B, Li Z, Fan X. Automatic block-wise genotype-phenotype association detection based on hidden Markov model. BMC Bioinformatics 2023; 24:138. [PMID: 37029361 PMCID: PMC10082540 DOI: 10.1186/s12859-023-05265-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2022] [Accepted: 03/31/2023] [Indexed: 04/09/2023] Open
Abstract
BACKGROUND For detecting genotype-phenotype association from case-control single nucleotide polymorphism (SNP) data, one class of methods relies on testing each genomic variant site individually. However, this approach ignores the tendency for associated variant sites to be spatially clustered instead of uniformly distributed along the genome. Therefore, a more recent class of methods looks for blocks of influential variant sites. Unfortunately, existing such methods either assume prior knowledge of the blocks, or rely on ad hoc moving windows. A principled method is needed to automatically detect genomic variant blocks which are associated with the phenotype. RESULTS In this paper, we introduce an automatic block-wise Genome-Wide Association Study (GWAS) method based on Hidden Markov model. Using case-control SNP data as input, our method detects the number of blocks associated with the phenotype and the locations of the blocks. Correspondingly, the minor allele of each variate site will be classified as having negative influence, no influence or positive influence on the phenotype. We evaluated our method using both datasets simulated from our model and datasets from a block model different from ours, and compared the performance with other methods. These included both simple methods based on the Fisher's exact test, applied site-by-site, as well as more complex methods built into the recent Zoom-Focus Algorithm. Across all simulations, our method consistently outperformed the comparisons. CONCLUSIONS With its demonstrated better performance, we expect our algorithm for detecting influential variant sites may help find more accurate signals across a wide range of case-control GWAS.
Collapse
Affiliation(s)
- Jin Du
- Department of Statistics, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong.
| | - Chaojie Wang
- School of Mathematical Science, Jiangsu University, Zhenjiang, Jiangsu Province, China
| | - Lijun Wang
- Department of Statistics, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong
| | - Shanjun Mao
- College of Finance and Statistics, Hunan University, Changsha, Hunan Province, China
| | - Bencong Zhu
- Department of Statistics, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong
| | - Zheng Li
- Department of Surgery, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong
| | - Xiaodan Fan
- Department of Statistics, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong.
| |
Collapse
|
4
|
Ahmed YW, Alemu BA, Bekele SA, Gizaw ST, Zerihun MF, Wabalo EK, Teklemariam MD, Mihrete TK, Hanurry EY, Amogne TG, Gebrehiwot AD, Berga TN, Haile EA, Edo DO, Alemu BD. Epigenetic tumor heterogeneity in the era of single-cell profiling with nanopore sequencing. Clin Epigenetics 2022; 14:107. [PMID: 36030244 PMCID: PMC9419648 DOI: 10.1186/s13148-022-01323-6] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2022] [Accepted: 08/12/2022] [Indexed: 11/29/2022] Open
Abstract
Nanopore sequencing has brought the technology to the next generation in the science of sequencing. This is achieved through research advancing on: pore efficiency, creating mechanisms to control DNA translocation, enhancing signal-to-noise ratio, and expanding to long-read ranges. Heterogeneity regarding epigenetics would be broad as mutations in the epigenome are sensitive to cause new challenges in cancer research. Epigenetic enzymes which catalyze DNA methylation and histone modification are dysregulated in cancer cells and cause numerous heterogeneous clones to evolve. Detection of this heterogeneity in these clones plays an indispensable role in the treatment of various cancer types. With single-cell profiling, the nanopore sequencing technology could provide a simple sequence at long reads and is expected to be used soon at the bedside or doctor's office. Here, we review the advancements of nanopore sequencing and its use in the detection of epigenetic heterogeneity in cancer.
Collapse
Affiliation(s)
- Yohannis Wondwosen Ahmed
- Department of Medical Biochemistry, School of Medicine, College of Health Sciences, Addis Ababa University, P.O. Box: 9086, Addis Ababa, Ethiopia.
| | - Berhan Ababaw Alemu
- Department of Medical Biochemistry, School of Medicine, St. Paul's Hospital, Millennium Medical College, Addis Ababa, Ethiopia
| | - Sisay Addisu Bekele
- Department of Medical Biochemistry, School of Medicine, College of Health Sciences, Addis Ababa University, P.O. Box: 9086, Addis Ababa, Ethiopia
| | - Solomon Tebeje Gizaw
- Department of Medical Biochemistry, School of Medicine, College of Health Sciences, Addis Ababa University, P.O. Box: 9086, Addis Ababa, Ethiopia
| | - Muluken Fekadie Zerihun
- Department of Medical Biochemistry, School of Medicine, College of Health Sciences, Addis Ababa University, P.O. Box: 9086, Addis Ababa, Ethiopia
| | - Endriyas Kelta Wabalo
- Department of Medical Biochemistry, School of Medicine, College of Health Sciences, Addis Ababa University, P.O. Box: 9086, Addis Ababa, Ethiopia
| | - Maria Degef Teklemariam
- Department of Medical Biochemistry, School of Medicine, College of Health Sciences, Addis Ababa University, P.O. Box: 9086, Addis Ababa, Ethiopia
| | - Tsehayneh Kelemu Mihrete
- Department of Medical Biochemistry, School of Medicine, College of Health Sciences, Addis Ababa University, P.O. Box: 9086, Addis Ababa, Ethiopia
| | - Endris Yibru Hanurry
- Department of Medical Biochemistry, School of Medicine, College of Health Sciences, Addis Ababa University, P.O. Box: 9086, Addis Ababa, Ethiopia
| | - Tensae Gebru Amogne
- Department of Medical Biochemistry, School of Medicine, College of Health Sciences, Addis Ababa University, P.O. Box: 9086, Addis Ababa, Ethiopia
| | - Assaye Desalegne Gebrehiwot
- Department of Medical Anatomy, School of Medicine, College of Health Sciences, Addis Ababa University, Addis Ababa, Ethiopia
| | - Tamirat Nida Berga
- Department of Medical Biochemistry, School of Medicine, College of Health Sciences, Addis Ababa University, P.O. Box: 9086, Addis Ababa, Ethiopia
| | - Ebsitu Abate Haile
- Department of Medical Biochemistry, School of Medicine, College of Health Sciences, Addis Ababa University, P.O. Box: 9086, Addis Ababa, Ethiopia
| | - Dessiet Oma Edo
- Department of Medical Biochemistry, School of Medicine, College of Health Sciences, Addis Ababa University, P.O. Box: 9086, Addis Ababa, Ethiopia
| | - Bizuwork Derebew Alemu
- Department of Statistics, College of Natural and Computational Sciences, Mizan Tepi University, Tepi, Ethiopia
| |
Collapse
|
5
|
Yuan Z, Yang H, Pan L, Zhao W, Liang L, Gatera A, Tucker MR, Xu D. Systematic identification and expression profiles of the BAHD superfamily acyltransferases in barley (Hordeum vulgare). Sci Rep 2022; 12:5063. [PMID: 35332203 PMCID: PMC8948222 DOI: 10.1038/s41598-022-08983-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2021] [Accepted: 03/14/2022] [Indexed: 12/28/2022] Open
Abstract
BAHD superfamily acyltransferases play an important role in catalyzing and regulating secondary metabolism in plants. Despite this, there is relatively little information regarding the BAHD superfamily in barley. In this study, we identified 116 HvBAHD acyltransferases from the barley genome. Based on phylogenetic analysis and classification in model monocotyledonous and dicotyledonous plants, we divided the genes into eight groups, I-a, I-b, II, III-a, III-b, IV, V-a and V-b. The Clade IV genes, including Agmatine Coumarol Transferase (ACT) that is associated with resistance of plants to Gibberella fungi, were absent in Arabidopsis. Cis-regulatory element analysis of the HvBAHDs showed that the genes respond positively to GA3 treatment. In-silico expression and qPCR analysis showed the HvBAHD genes are expressed in a range of tissues and developmental stages, and highly enriched in the seedling stage, consistent with diverse roles. Single nucleotide polymorphism (SNP) scanning analysis revealed that the natural variation in the coding regions of the HvBAHDs is low and the sequences have been conserved during barley domestication. Our results reveal the complexity of the HvBAHDs and will help facilitate their analysis in further studies.
Collapse
Affiliation(s)
- Zhen Yuan
- School of Agronomy, Anhui Agricultural University, Hefei, 230036, China
| | - Hongliang Yang
- School of Agronomy, Anhui Agricultural University, Hefei, 230036, China
| | - Leiwen Pan
- School of Agronomy, Anhui Agricultural University, Hefei, 230036, China
| | - Wenhui Zhao
- School of Agronomy, Anhui Agricultural University, Hefei, 230036, China
| | - Lunping Liang
- School of Agronomy, Anhui Agricultural University, Hefei, 230036, China
| | - Anicet Gatera
- School of Agronomy, Anhui Agricultural University, Hefei, 230036, China
| | - Matthew R Tucker
- School of Agriculture, Food and Wine, Waite Research Institute, University of Adelaide, Adelaide, SA, 5064, Australia
| | - Dawei Xu
- School of Agronomy, Anhui Agricultural University, Hefei, 230036, China.
| |
Collapse
|
6
|
ASRmiRNA: Abiotic Stress-Responsive miRNA Prediction in Plants by Using Machine Learning Algorithms with Pseudo K-Tuple Nucleotide Compositional Features. Int J Mol Sci 2022; 23:ijms23031612. [PMID: 35163534 PMCID: PMC8835813 DOI: 10.3390/ijms23031612] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2021] [Revised: 01/23/2022] [Accepted: 01/26/2022] [Indexed: 02/04/2023] Open
Abstract
MicroRNAs (miRNAs) play a significant role in plant response to different abiotic stresses. Thus, identification of abiotic stress-responsive miRNAs holds immense importance in crop breeding programmes to develop cultivars resistant to abiotic stresses. In this study, we developed a machine learning-based computational method for prediction of miRNAs associated with abiotic stresses. Three types of datasets were used for prediction, i.e., miRNA, Pre-miRNA, and Pre-miRNA + miRNA. The pseudo K-tuple nucleotide compositional features were generated for each sequence to transform the sequence data into numeric feature vectors. Support vector machine (SVM) was employed for prediction. The area under receiver operating characteristics curve (auROC) of 70.21, 69.71, 77.94 and area under precision-recall curve (auPRC) of 69.96, 65.64, 77.32 percentages were obtained for miRNA, Pre-miRNA, and Pre-miRNA + miRNA datasets, respectively. Overall prediction accuracies for the independent test set were 62.33, 64.85, 69.21 percentages, respectively, for the three datasets. The SVM also achieved higher accuracy than other learning methods such as random forest, extreme gradient boosting, and adaptive boosting. To implement our method with ease, an online prediction server “ASRmiRNA” has been developed. The proposed approach is believed to supplement the existing effort for identification of abiotic stress-responsive miRNAs and Pre-miRNAs.
Collapse
|
7
|
Nowak S, Rosin M, Stuerzlinger W, Bartram L. Visual Analytics: A Method to Explore Natural Histories of Oral Epithelial Dysplasia. FRONTIERS IN ORAL HEALTH 2022; 2:703874. [PMID: 35048041 PMCID: PMC8757761 DOI: 10.3389/froh.2021.703874] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2021] [Accepted: 07/02/2021] [Indexed: 11/17/2022] Open
Abstract
Risk assessment and follow-up of oral potentially malignant disorders in patients with mild or moderate oral epithelial dysplasia is an ongoing challenge for improved oral cancer prevention. Part of the challenge is a lack of understanding of how observable features of such dysplasia, gathered as data by clinicians during follow-up, relate to underlying biological processes driving progression. Current research is at an exploratory phase where the precise questions to ask are not known. While traditional statistical and the newer machine learning and artificial intelligence methods are effective in well-defined problem spaces with large datasets, these are not the circumstances we face currently. We argue that the field is in need of exploratory methods that can better integrate clinical and scientific knowledge into analysis to iteratively generate viable hypotheses. In this perspective, we propose that visual analytics presents a set of methods well-suited to these needs. We illustrate how visual analytics excels at generating viable research hypotheses by describing our experiences using visual analytics to explore temporal shifts in the clinical presentation of epithelial dysplasia. Visual analytics complements existing methods and fulfills a critical and at-present neglected need in the formative stages of inquiry we are facing.
Collapse
Affiliation(s)
- Stan Nowak
- School of Interactive Arts and Technology, Simon Fraser University, Burnaby, BC, Canada
| | - Miriam Rosin
- BC Oral Cancer Prevention Program, Cancer Control Research, BC Cancer, Vancouver, BC, Canada.,Department of Biomedical Physiology and Kinesiology, Simon Fraser University, Burnaby, BC, Canada
| | - Wolfgang Stuerzlinger
- School of Interactive Arts and Technology, Simon Fraser University, Burnaby, BC, Canada
| | - Lyn Bartram
- School of Interactive Arts and Technology, Simon Fraser University, Burnaby, BC, Canada
| |
Collapse
|
8
|
Domínguez-Santos R, Pérez-Cobas AE, Cuti P, Pérez-Brocal V, García-Ferris C, Moya A, Latorre A, Gil R. Interkingdom Gut Microbiome and Resistome of the Cockroach Blattella germanica. mSystems 2021; 6:6/3/e01213-20. [PMID: 33975971 PMCID: PMC8125077 DOI: 10.1128/msystems.01213-20] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
Cockroaches are intriguing animals with two coexisting symbiotic systems, an endosymbiont in the fat body, involved in nitrogen metabolism, and a gut microbiome whose diversity, complexity, role, and developmental dynamics have not been fully elucidated. In this work, we present a metagenomic approach to study Blattella germanica populations not treated, treated with kanamycin, and recovered after treatment, both naturally and by adding feces to the diet, with the aim of better understanding the structure and function of its gut microbiome along the development as well as the characterization of its resistome.IMPORTANCE For the first time, we analyze the interkingdom hindgut microbiome of this species, including bacteria, fungi, archaea, and viruses. Network analysis reveals putative cooperation between core bacteria that could be key for ecosystem equilibrium. We also show how antibiotic treatments alter microbiota diversity and function, while both features are restored after one untreated generation. Combining data from B. germanica treated with three antibiotics, we have characterized this species' resistome. It includes genes involved in resistance to several broad-spectrum antibiotics frequently used in the clinic. The presence of genetic elements involved in DNA mobilization indicates that they can be transferred among microbiota partners. Therefore, cockroaches can be considered reservoirs of antibiotic resistance genes (ARGs) and potential transmission vectors.
Collapse
Affiliation(s)
- Rebeca Domínguez-Santos
- Institute for Integrative Systems Biology (ISysBio), University of Valencia and CSIC, Valencia, Spain
| | | | - Paolo Cuti
- Institute for Integrative Systems Biology (ISysBio), University of Valencia and CSIC, Valencia, Spain
| | - Vicente Pérez-Brocal
- Genomics and Health Area, Foundation for the Promotion of Sanitary and Biomedical Research (FISABIO), Valencia, Spain
- Biomedical Research Center Network of Epidemiology and Public Health (CIBEResp), Madrid, Spain
| | - Carlos García-Ferris
- Institute for Integrative Systems Biology (ISysBio), University of Valencia and CSIC, Valencia, Spain
- Department of Biochemistry and Molecular Biology, University of Valencia, Valencia, Spain
| | - Andrés Moya
- Institute for Integrative Systems Biology (ISysBio), University of Valencia and CSIC, Valencia, Spain
- Genomics and Health Area, Foundation for the Promotion of Sanitary and Biomedical Research (FISABIO), Valencia, Spain
- Biomedical Research Center Network of Epidemiology and Public Health (CIBEResp), Madrid, Spain
| | - Amparo Latorre
- Institute for Integrative Systems Biology (ISysBio), University of Valencia and CSIC, Valencia, Spain
- Genomics and Health Area, Foundation for the Promotion of Sanitary and Biomedical Research (FISABIO), Valencia, Spain
- Biomedical Research Center Network of Epidemiology and Public Health (CIBEResp), Madrid, Spain
| | - Rosario Gil
- Institute for Integrative Systems Biology (ISysBio), University of Valencia and CSIC, Valencia, Spain
- Genomics and Health Area, Foundation for the Promotion of Sanitary and Biomedical Research (FISABIO), Valencia, Spain
| |
Collapse
|
9
|
McClintock BT, Langrock R, Gimenez O, Cam E, Borchers DL, Glennie R, Patterson TA. Uncovering ecological state dynamics with hidden Markov models. Ecol Lett 2020; 23:1878-1903. [PMID: 33073921 PMCID: PMC7702077 DOI: 10.1111/ele.13610] [Citation(s) in RCA: 56] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2020] [Revised: 08/13/2020] [Accepted: 08/25/2020] [Indexed: 01/03/2023]
Abstract
Ecological systems can often be characterised by changes among a finite set of underlying states pertaining to individuals, populations, communities or entire ecosystems through time. Owing to the inherent difficulty of empirical field studies, ecological state dynamics operating at any level of this hierarchy can often be unobservable or 'hidden'. Ecologists must therefore often contend with incomplete or indirect observations that are somehow related to these underlying processes. By formally disentangling state and observation processes based on simple yet powerful mathematical properties that can be used to describe many ecological phenomena, hidden Markov models (HMMs) can facilitate inferences about complex system state dynamics that might otherwise be intractable. However, HMMs have only recently begun to gain traction within the broader ecological community. We provide a gentle introduction to HMMs, establish some common terminology, review the immense scope of HMMs for applied ecological research and provide a tutorial on implementation and interpretation. By illustrating how practitioners can use a simple conceptual template to customise HMMs for their specific systems of interest, revealing methodological links between existing applications, and highlighting some practical considerations and limitations of these approaches, our goal is to help establish HMMs as a fundamental inferential tool for ecologists.
Collapse
Affiliation(s)
| | - Roland Langrock
- Department of Business Administration and EconomicsBielefeld UniversityBielefeldGermany
| | - Olivier Gimenez
- CNRS Centre d'Ecologie Fonctionnelle et EvolutiveMontpellierFrance
| | - Emmanuelle Cam
- Laboratoire des Sciences de l'Environnement MarinInstitut Universitaire Européen de la MerUniv. BrestCNRS, IRDIfremerFrance
| | - David L. Borchers
- School of Mathematics and StatisticsUniversity of St AndrewsSt AndrewsUK
| | - Richard Glennie
- School of Mathematics and StatisticsUniversity of St AndrewsSt AndrewsUK
| | | |
Collapse
|
10
|
Khodaei A, Feizi-Derakhshi MR, Mozaffari-Tazehkand B. A Markov chain-based feature extraction method for classification and identification of cancerous DNA sequences. ACTA ACUST UNITED AC 2020; 11:87-99. [PMID: 33842279 PMCID: PMC8022238 DOI: 10.34172/bi.2021.16] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2019] [Revised: 01/06/2020] [Accepted: 01/21/2020] [Indexed: 01/06/2023]
Abstract
![]()
Introduction: In recent decades, the growing rate of cancer incidence is a big concern for most societies. Due to the genetic origins of cancer disease, its internal structure is necessary for the study of this disease. Methods: In this research, cancer data are analyzed based on DNA sequences. The transition probability of occurring two pairs of nucleotides in DNA sequences has Markovian property. This property inspires the idea of feature dimension reduction of DNA sequence for overcoming the high computational overhead of genes analysis. This idea is utilized in this research based on the Markovian property of DNA sequences. This mapping decreases feature dimensions and conserves basic properties for discrimination of cancerous and non-cancerous genes. Results: The results showed that a non-linear support vector machine (SVM) classifier with RBF and polynomial kernel functions can discriminate selected cancerous samples from non-cancerous ones. Experimental results based on the 10-fold cross-validation and accuracy metrics verified that the proposed method has low computational overhead and high accuracy. Conclusion: The proposed algorithm was successfully tested on related research case studies. In general, a combination of proposed Markovian-based feature reduction and non-linear SVM classifier can be considered as one of the best methods for discrimination of cancerous and non-cancerous genes.
Collapse
Affiliation(s)
- Amin Khodaei
- Faculty of Electrical and Computer Engineering, University of Tabriz, Tabriz, Iran
| | | | | |
Collapse
|
11
|
Li Z, Guan Y, Yuan X, Zheng P, Zhu H. Prediction of Sphingosine protein-coding regions with a self adaptive spectral rotation method. PLoS One 2019; 14:e0214442. [PMID: 30943219 PMCID: PMC6447165 DOI: 10.1371/journal.pone.0214442] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2019] [Accepted: 03/13/2019] [Indexed: 01/08/2023] Open
Abstract
Identifying protein coding regions in DNA sequences by computational methods is an active research topic. Welan gum produced by Sphingomonas sp. WG has great application potential in oil recovery and concrete construction industry. Predicting the coding regions in the Sphingomonas sp. WG genome and addressing the mechanism underlying the explanation for the synthesis of Welan gum metabolism is an important issue at present. In this study, we apply a self adaptive spectral rotation (SASR, for short) method, which is based on the investigation of the Triplet Periodicity property, to predict the coding regions of the whole-genome data of Sphingomonas sp. WG without any previous training process, and 1115 suspected gene fragments are obtained. Suspected gene fragments are subjected to a similarity search against the non-redundant protein sequences (nr) database of NCBI with blastx, and 762 suspected gene fragments have been labeled as genes in the nr database.
Collapse
Affiliation(s)
- Zhongwei Li
- College of Computer and Communication Engineering, China University of Petroleum, Qingdao, Shandong, China
| | - Yanan Guan
- College of Computer and Communication Engineering, China University of Petroleum, Qingdao, Shandong, China
| | - Xiang Yuan
- College of Computer and Communication Engineering, China University of Petroleum, Qingdao, Shandong, China
| | - Pan Zheng
- Department of Accounting and Information Systems, University of Canterbury, Christchurch, New Zealand
| | - Hu Zhu
- College of Chemistry and Materials, Fujian Normal University, Fuzhou, China
| |
Collapse
|
12
|
Meher PK, Sahu TK, Gahoi S, Tomar R, Rao AR. funbarRF: DNA barcode-based fungal species prediction using multiclass Random Forest supervised learning model. BMC Genet 2019; 20:2. [PMID: 30616524 PMCID: PMC6323839 DOI: 10.1186/s12863-018-0710-z] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2018] [Accepted: 12/26/2018] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Identification of unknown fungal species aids to the conservation of fungal diversity. As many fungal species cannot be cultured, morphological identification of those species is almost impossible. But, DNA barcoding technique can be employed for identification of such species. For fungal taxonomy prediction, the ITS (internal transcribed spacer) region of rDNA (ribosomal DNA) is used as barcode. Though the computational prediction of fungal species has become feasible with the availability of huge volume of barcode sequences in public domain, prediction of fungal species is challenging due to high degree of variability among ITS regions within species. RESULTS A Random Forest (RF)-based predictor was built for identification of unknown fungal species. The reference and query sequences were mapped onto numeric features based on gapped base pair compositions, and then used as training and test sets respectively for prediction of fungal species using RF. More than 85% accuracy was found when 4 sequences per species in the reference set were utilized; whereas it was seen to be stabilized at ~88% if ≥7 sequence per species in the reference set were used for training of the model. The proposed model achieved comparable accuracy, while evaluated against existing methods through cross-validation procedure. The proposed model also outperformed several existing models used for identification of different species other than fungi. CONCLUSIONS An online prediction server "funbarRF" is established at http://cabgrid.res.in:8080/funbarrf/ for fungal species identification. Besides, an R-package funbarRF ( https://cran.r-project.org/web/packages/funbarRF/ ) is also available for prediction using high throughput sequence data. The effort put in this work will certainly supplement the future endeavors in the direction of fungal taxonomy assignments based on DNA barcode.
Collapse
Affiliation(s)
- Prabina Kumar Meher
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, 110012 India
| | - Tanmaya Kumar Sahu
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, 110012 India
| | - Shachi Gahoi
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, 110012 India
| | - Ruchi Tomar
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, 110012 India
- Department of Bioinformatics, Janta Vedic College, Baraut, Baghpat, Uttar Pradesh 250611 India
| | - Atmakuri Ramakrishna Rao
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, 110012 India
| |
Collapse
|
13
|
Abstract
Gene prediction, also known as gene identification, gene finding, gene recognition, or gene discovery, is among one of the important problems of molecular biology and is receiving increasing attention due to the advent of large-scale genome sequencing projects. We designed an ab initio model (called ChemGenome) for gene prediction in prokaryotic genomes based on physicochemical characteristics of codons. In this chapter, we present the methodology of the latest version of this model ChemGenome2.1 (CG2.1). The first module of the protocol builds a three-dimensional vector from three calculated quantities for each codon-the double-helical trinucleotide base pairing energy, the base pair stacking energy, and an index of the propensity of a codon for protein-nucleic acid interactions. As this three-dimensional vector moves along any genome, the net orientation of the resultant vector should differ significantly for gene and non-genic regions to make a distinction feasible. The predicted putative protein-coding genes from above parameters are passed through a second module of the protocol which reduces the number of false positives by utilizing a filter based on stereochemical properties of protein sequences. The chemical properties of amino acid side chains taken into consideration are the presence of sp3 hybridized γ carbon atom, hydrogen bond donor ability, short/absence of δ carbon and linearity of the side chains/non-occurrence of bi-dentate forks with terminal hydrogen atoms in the side chain. The final prediction of the potential protein-coding genes is based on the frequency of occurrence of amino acids in the predicted protein sequences and their deviation from the frequency values of Swissprot protein sequences, both at monomer and tripeptide levels. The final screening is based on Z-score. Though CG2.1 is a gene finding tool for prokaryotes, considering the underlying similarity in the chemical and physical properties of DNA among prokaryotes and eukaryotes, we attempted to evaluate its applicability for gene finding in the lower eukaryotes. The results give a hope that the concept of gene finding based on physicochemical model of codons is a viable idea for eukaryotes as well, though, undoubtedly, improvements are needed.
Collapse
Affiliation(s)
- Akhilesh Mishra
- Supercomputing Facility for Bioinformatics and Computational Biology, Indian Institute of Technology Delhi, New Delhi, India
- Kusuma School of Biological Sciences, Indian Institute of Technology Delhi, New Delhi, India
| | - Priyanka Siwach
- Supercomputing Facility for Bioinformatics and Computational Biology, Indian Institute of Technology Delhi, New Delhi, India
- Department of Biotechnology, Chaudhary Devi Lal University, Sirsa, Haryana, India
| | - Poonam Singhal
- Supercomputing Facility for Bioinformatics and Computational Biology, Indian Institute of Technology Delhi, New Delhi, India
| | - B Jayaram
- Supercomputing Facility for Bioinformatics and Computational Biology, Indian Institute of Technology Delhi, New Delhi, India.
- Kusuma School of Biological Sciences, Indian Institute of Technology Delhi, New Delhi, India.
- Department of Chemistry, Indian Institute of Technology Delhi, New Delhi, India.
| |
Collapse
|
14
|
Meher PK, Sahu TK, Mohanty J, Gahoi S, Purru S, Grover M, Rao AR. nifPred: Proteome-Wide Identification and Categorization of Nitrogen-Fixation Proteins of Diaztrophs Based on Composition-Transition-Distribution Features Using Support Vector Machine. Front Microbiol 2018; 9:1100. [PMID: 29896173 PMCID: PMC5986947 DOI: 10.3389/fmicb.2018.01100] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2017] [Accepted: 05/08/2018] [Indexed: 11/13/2022] Open
Abstract
As inorganic nitrogen compounds are essential for basic building blocks of life (e.g., nucleotides and amino acids), the role of biological nitrogen-fixation (BNF) is indispensible. All nitrogen fixing microbes rely on the same nitrogenase enzyme for nitrogen reduction, which is in fact an enzyme complex consists of as many as 20 genes. However, the occurrence of six genes viz., nifB, nifD, nifE, nifH, nifK, and nifN has been proposed to be essential for a functional nitrogenase enzyme. Therefore, identification of these genes is important to understand the mechanism of BNF as well as to explore the possibilities for improving BNF from agricultural sustainability point of view. Further, though the computational tools are available for the annotation and phylogenetic analysis of nifH gene sequences alone, to the best of our knowledge no tool is available for the computational prediction of the above mentioned six categories of nitrogen-fixation (nif) genes or proteins. Thus, we proposed an approach, which is first of its kind for the computational identification of nif proteins encoded by the six categories of nif genes. Sequence-derived features were employed to map the input sequences into vectors of numeric observations that were subsequently fed to the support vector machine as input. Two types of classifier were constructed: (i) a binary classifier for classification of nif and non-nitrogen-fixation (non-nif) proteins, and (ii) a multi-class classifier for classification of six categories of nif proteins. Higher accuracies were observed for the combination of composition-transition-distribution (CTD) feature set and radial kernel, as compared to the other feature-kernel combinations. The overall accuracies were observed >90% in both binary and multi-class classifications. The developed approach further achieved >92% accuracy, while evaluated with blind (independent) test datasets. The developed approach also produced higher accuracy in identifying nif proteins, while evaluated using proteome-wide datasets of several species. Furthermore, we established a prediction server nifPred (http://webapp.cabgrid.res.in/nifPred) to assist the scientific community for proteome-wide identification of six categories of nif proteins. Besides, the source code of nifPred is also available at https://github.com/PrabinaMeher/nifPred. The developed web server is expected to supplement the transcriptional profiling and comparative genomics studies for the identification and functional annotation of genes related to BNF.
Collapse
Affiliation(s)
- Prabina K Meher
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India
| | - Tanmaya K Sahu
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India
| | - Jyotilipsa Mohanty
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India.,Department of Bioinformatics, Orissa University of Agriculture and Technology, Bhubaneswar, India
| | - Shachi Gahoi
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India
| | - Supriya Purru
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India
| | - Monendra Grover
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India
| | - Atmakuri R Rao
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India
| |
Collapse
|
15
|
Meher PK, Sahu TK, Banchariya A, Rao AR. DIRProt: a computational approach for discriminating insecticide resistant proteins from non-resistant proteins. BMC Bioinformatics 2017; 18:190. [PMID: 28340571 PMCID: PMC5364559 DOI: 10.1186/s12859-017-1587-y] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2016] [Accepted: 03/09/2017] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND Insecticide resistance is a major challenge for the control program of insect pests in the fields of crop protection, human and animal health etc. Resistance to different insecticides is conferred by the proteins encoded from certain class of genes of the insects. To distinguish the insecticide resistant proteins from non-resistant proteins, no computational tool is available till date. Thus, development of such a computational tool will be helpful in predicting the insecticide resistant proteins, which can be targeted for developing appropriate insecticides. RESULTS Five different sets of feature viz., amino acid composition (AAC), di-peptide composition (DPC), pseudo amino acid composition (PAAC), composition-transition-distribution (CTD) and auto-correlation function (ACF) were used to map the protein sequences into numeric feature vectors. The encoded numeric vectors were then used as input in support vector machine (SVM) for classification of insecticide resistant and non-resistant proteins. Higher accuracies were obtained under RBF kernel than that of other kernels. Further, accuracies were observed to be higher for DPC feature set as compared to others. The proposed approach achieved an overall accuracy of >90% in discriminating resistant from non-resistant proteins. Further, the two classes of resistant proteins i.e., detoxification-based and target-based were discriminated from non-resistant proteins with >95% accuracy. Besides, >95% accuracy was also observed for discrimination of proteins involved in detoxification- and target-based resistance mechanisms. The proposed approach not only outperformed Blastp, PSI-Blast and Delta-Blast algorithms, but also achieved >92% accuracy while assessed using an independent dataset of 75 insecticide resistant proteins. CONCLUSIONS This paper presents the first computational approach for discriminating the insecticide resistant proteins from non-resistant proteins. Based on the proposed approach, an online prediction server DIRProt has also been developed for computational prediction of insecticide resistant proteins, which is accessible at http://cabgrid.res.in:8080/dirprot/ . The proposed approach is believed to supplement the efforts needed to develop dynamic insecticides in wet-lab by targeting the insecticide resistant proteins.
Collapse
Affiliation(s)
- Prabina Kumar Meher
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, 110012, India
| | - Tanmaya Kumar Sahu
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, 110012, India
| | - Anjali Banchariya
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, 110012, India.,Department of Bioinformatics, Janta Vedic College, Baraut, Baghpat, 250611, Uttar Pradesh, India
| | - Atmakuri Ramakrishna Rao
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, 110012, India.
| |
Collapse
|
16
|
Meher PK, Sahu TK, Saini V, Rao AR. Predicting antimicrobial peptides with improved accuracy by incorporating the compositional, physico-chemical and structural features into Chou's general PseAAC. Sci Rep 2017; 7:42362. [PMID: 28205576 PMCID: PMC5304217 DOI: 10.1038/srep42362] [Citation(s) in RCA: 274] [Impact Index Per Article: 39.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2016] [Accepted: 01/09/2017] [Indexed: 11/13/2022] Open
Abstract
Antimicrobial peptides (AMPs) are important components of the innate immune system that have been found to be effective against disease causing pathogens. Identification of AMPs through wet-lab experiment is expensive. Therefore, development of efficient computational tool is essential to identify the best candidate AMP prior to the in vitro experimentation. In this study, we made an attempt to develop a support vector machine (SVM) based computational approach for prediction of AMPs with improved accuracy. Initially, compositional, physico-chemical and structural features of the peptides were generated that were subsequently used as input in SVM for prediction of AMPs. The proposed approach achieved higher accuracy than several existing approaches, while compared using benchmark dataset. Based on the proposed approach, an online prediction server iAMPpred has also been developed to help the scientific community in predicting AMPs, which is freely accessible at http://cabgrid.res.in:8080/amppred/. The proposed approach is believed to supplement the tools and techniques that have been developed in the past for prediction of AMPs.
Collapse
Affiliation(s)
- Prabina Kumar Meher
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi-110012, India
| | - Tanmaya Kumar Sahu
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi-110012, India
| | - Varsha Saini
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi-110012, India.,Department of Bioinformatics, Janta Vedic College, Baraut, Baghpat-250611, Uttar Pradesh, India
| | - Atmakuri Ramakrishna Rao
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi-110012, India
| |
Collapse
|
17
|
Al Bataineh M, Al-qudah Z. A novel gene identification algorithm with Bayesian classification. Biomed Signal Process Control 2017. [DOI: 10.1016/j.bspc.2016.07.002] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
18
|
Meher PK, Sahu TK, Rao AR, Wahi SD. A computational approach for prediction of donor splice sites with improved accuracy. J Theor Biol 2016; 404:285-294. [PMID: 27302911 DOI: 10.1016/j.jtbi.2016.06.013] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2015] [Revised: 04/18/2016] [Accepted: 06/09/2016] [Indexed: 11/24/2022]
Abstract
Identification of splice sites is important due to their key role in predicting the exon-intron structure of protein coding genes. Though several approaches have been developed for the prediction of splice sites, further improvement in the prediction accuracy will help predict gene structure more accurately. This paper presents a computational approach for prediction of donor splice sites with higher accuracy. In this approach, true and false splice sites were first encoded into numeric vectors and then used as input in artificial neural network (ANN), support vector machine (SVM) and random forest (RF) for prediction. ANN and SVM were found to perform equally and better than RF, while tested on HS3D and NN269 datasets. Further, the performance of ANN, SVM and RF were analyzed by using an independent test set of 50 genes and found that the prediction accuracy of ANN was higher than that of SVM and RF. All the predictors achieved higher accuracy while compared with the existing methods like NNsplice, MEM, MDD, WMM, MM1, FSPLICE, GeneID and ASSP, using the independent test set. We have also developed an online prediction server (PreDOSS) available at http://cabgrid.res.in:8080/predoss, for prediction of donor splice sites using the proposed approach.
Collapse
Affiliation(s)
- Prabina Kumar Meher
- ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India.
| | - Tanmaya Kumar Sahu
- ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India.
| | - A R Rao
- ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India.
| | - S D Wahi
- ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India.
| |
Collapse
|
19
|
Meher PK, Sahu TK, Rao AR, Wahi SD. Identification of donor splice sites using support vector machine: a computational approach based on positional, compositional and dependency features. Algorithms Mol Biol 2016; 11:16. [PMID: 27252772 PMCID: PMC4888255 DOI: 10.1186/s13015-016-0078-4] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2015] [Accepted: 05/17/2016] [Indexed: 11/16/2022] Open
Abstract
Background Identification of splice sites is essential for annotation of genes. Though existing approaches have achieved an acceptable level of accuracy, still there is a need for further improvement. Besides, most of the approaches are species-specific and hence it is required to develop approaches compatible across species. Results Each splice site sequence was transformed into a numeric vector of length 49, out of which four were positional, four were dependency and 41 were compositional features. Using the transformed vectors as input, prediction was made through support vector machine. Using balanced training set, the proposed approach achieved area under ROC curve (AUC-ROC) of 96.05, 96.96, 96.95, 96.24 % and area under PR curve (AUC-PR) of 97.64, 97.89, 97.91, 97.90 %, while tested on human, cattle, fish and worm datasets respectively. On the other hand, AUC-ROC of 97.21, 97.45, 97.41, 98.06 % and AUC-PR of 93.24, 93.34, 93.38, 92.29 % were obtained, while imbalanced training datasets were used. The proposed approach was found comparable with state-of-art splice site prediction approaches, while compared using the bench mark NN269 dataset and other datasets. Conclusions The proposed approach achieved consistent accuracy across different species as well as found comparable with the existing approaches. Thus, we believe that the proposed approach can be used as a complementary method to the existing methods for the prediction of splice sites. A web server named as ‘HSplice’ has also been developed based on the proposed approach for easy prediction of 5′ splice sites by the users and is freely available at http://cabgrid.res.in:8080/HSplice.
Collapse
|
20
|
A Comprehensive Review of Emerging Computational Methods for Gene Identification. JOURNAL OF INFORMATION PROCESSING SYSTEMS 2016. [DOI: 10.3745/jips.04.0023] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
|
21
|
|
22
|
El Yazid Boudaren M, Monfrini E, Pieczynski W, Aïssani A. Phasic Triplet Markov Chains. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2014; 36:2310-2316. [PMID: 26353069 DOI: 10.1109/tpami.2014.2327974] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Hidden Markov chains have been shown to be inadequate for data modeling under some complex conditions. In this work, we address the problem of statistical modeling of phenomena involving two heterogeneous system states. Such phenomena may arise in biology or communications, among other fields. Namely, we consider that a sequence of meaningful words is to be searched within a whole observation that also contains arbitrary one-by-one symbols. Moreover, a word may be interrupted at some site to be carried on later. Applying plain hidden Markov chains to such data, while ignoring their specificity, yields unsatisfactory results. The Phasic triplet Markov chain, proposed in this paper, overcomes this difficulty by means of an auxiliary underlying process in accordance with the triplet Markov chains theory. Related Bayesian restoration techniques and parameters estimation procedures according to the new model are then described. Finally, to assess the performance of the proposed model against the conventional hidden Markov chain model, experiments are conducted on synthetic and real data.
Collapse
|
23
|
Regional effects on chimera formation in 454 pyrosequenced amplicons from a mock community. J Microbiol 2014; 52:566-73. [DOI: 10.1007/s12275-014-3485-6] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2013] [Revised: 02/04/2014] [Accepted: 03/12/2014] [Indexed: 11/27/2022]
|
24
|
Molina J, Hazzouri KM, Nickrent D, Geisler M, Meyer RS, Pentony MM, Flowers JM, Pelser P, Barcelona J, Inovejas SA, Uy I, Yuan W, Wilkins O, Michel CI, LockLear S, Concepcion GP, Purugganan MD. Possible loss of the chloroplast genome in the parasitic flowering plant Rafflesia lagascae (Rafflesiaceae). Mol Biol Evol 2014; 31:793-803. [PMID: 24458431 PMCID: PMC3969568 DOI: 10.1093/molbev/msu051] [Citation(s) in RCA: 125] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
Rafflesia is a genus of holoparasitic plants endemic to Southeast Asia that has lost the ability to undertake photosynthesis. With short-read sequencing technology, we assembled a draft sequence of the mitochondrial genome of Rafflesia lagascae Blanco, a species endemic to the Philippine island of Luzon, with ∼350× sequencing depth coverage. Using multiple approaches, however, we were only able to identify small fragments of plastid sequences at low coverage depth (<2×) and could not recover any substantial portion of a chloroplast genome. The gene fragments we identified included photosynthesis and energy production genes (atp, ndh, pet, psa, psb, rbcL), ribosomal RNA genes (rrn16, rrn23), ribosomal protein genes (rps7, rps11, rps16), transfer RNA genes, as well as matK, accD, ycf2, and multiple nongenic regions from the inverted repeats. None of the identified plastid gene sequences had intact reading frames. Phylogenetic analysis suggests that ∼33% of these remnant plastid genes may have been horizontally transferred from the host plant genus Tetrastigma with the rest having ambiguous phylogenetic positions (<50% bootstrap support), except for psaB that was strongly allied with the plastid homolog in Nicotiana. Our inability to identify substantial plastid genome sequences from R. lagascae using multiple approaches—despite success in identifying and developing a draft assembly of the much larger mitochondrial genome—suggests that the parasitic plant genus Rafflesia may be the first plant group for which there is no recognizable plastid genome, or if present is found in cryptic form at very low levels.
Collapse
Affiliation(s)
- Jeanmaire Molina
- Department of Biology, Long Island University, Brooklyn
- Center for Genomics and Systems Biology, New York University
- *Corresponding author: E-mail: ;
| | - Khaled M. Hazzouri
- Center for Genomics and Systems Biology, NYU Abu Dhabi, Abu Dhabi, United Arab Emirates
| | - Daniel Nickrent
- Department of Plant Biology, Southern Illinois University, Carbondale
| | - Matthew Geisler
- Department of Plant Biology, Southern Illinois University, Carbondale
| | - Rachel S. Meyer
- Center for Genomics and Systems Biology, New York University
| | - Melissa M. Pentony
- Computational Genomics Core, Department of Genetics, Albert Einstein College of Medicine, Bronx, New York
| | - Jonathan M. Flowers
- Center for Genomics and Systems Biology, New York University
- Center for Genomics and Systems Biology, NYU Abu Dhabi, Abu Dhabi, United Arab Emirates
| | - Pieter Pelser
- School of Biological Sciences, University of Canterbury, Christchurch, New Zealand
| | - Julie Barcelona
- School of Biological Sciences, University of Canterbury, Christchurch, New Zealand
| | - Samuel Alan Inovejas
- Electron Microscope Facility, St. Luke’s Medical Center, Quezon City, Philippines
| | - Iris Uy
- Philippine Genome Center, University of the Philippines, Diliman, Quezon City, Philippines
| | - Wei Yuan
- Center for Genomics and Systems Biology, New York University
| | - Olivia Wilkins
- Center for Genomics and Systems Biology, New York University
| | | | | | - Gisela P. Concepcion
- Philippine Genome Center, University of the Philippines, Diliman, Quezon City, Philippines
| | - Michael D. Purugganan
- Center for Genomics and Systems Biology, New York University
- Center for Genomics and Systems Biology, NYU Abu Dhabi, Abu Dhabi, United Arab Emirates
- *Corresponding author: E-mail: ;
| |
Collapse
|
25
|
Won KJ, Zhang X, Wang T, Ding B, Raha D, Snyder M, Ren B, Wang W. Comparative annotation of functional regions in the human genome using epigenomic data. Nucleic Acids Res 2013; 41:4423-32. [PMID: 23482391 PMCID: PMC3632130 DOI: 10.1093/nar/gkt143] [Citation(s) in RCA: 43] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Abstract
Epigenetic regulation is dynamic and cell-type dependent. The recently available epigenomic data in multiple cell types provide an unprecedented opportunity for a comparative study of epigenetic landscape. We developed a machine-learning method called ChroModule to annotate the epigenetic states in eight ENCyclopedia Of DNA Elements cell types. The trained model successfully captured the characteristic histone-modification patterns associated with regulatory elements, such as promoters and enhancers, and showed superior performance on identifying enhancers compared with the state-of-art methods. In addition, given the fixed number of epigenetic states in the model, ChroModule allows straightforward illustration of epigenetic variability in multiple cell types. Using this feature, we found that invariable and variable epigenetic states across cell types correspond to housekeeping functions and stimulus response, respectively. Especially, we observed that enhancers, but not the other regulatory elements, dictate cell specificity, as similar cell types share common enhancers, and cell-type-specific enhancers are often bound by transcription factors playing critical roles in that cell type. More interestingly, we found some genomic regions are dormant in cell type but primed to become active in other cell types. These observations highlight the usefulness of ChroModule in comparative analysis and interpretation of multiple epigenomes.
Collapse
Affiliation(s)
- Kyoung-Jae Won
- Department of Chemistry and Biochemistry, University of California San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0359, USA
| | | | | | | | | | | | | | | |
Collapse
|
26
|
Detecting the borders between coding and non-coding DNA regions in prokaryotes based on recursive segmentation and nucleotide doublets statistics. BMC Genomics 2012; 13 Suppl 8:S19. [PMID: 23282225 PMCID: PMC3535712 DOI: 10.1186/1471-2164-13-s8-s19] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Detecting the borders between coding and non-coding regions is an essential step in the genome annotation. And information entropy measures are useful for describing the signals in genome sequence. However, the accuracies of previous methods of finding borders based on entropy segmentation method still need to be improved. METHODS In this study, we first applied a new recursive entropic segmentation method on DNA sequences to get preliminary significant cuts. A 22-symbol alphabet is used to capture the differential composition of nucleotide doublets and stop codon patterns along three phases in both DNA strands. This process requires no prior training datasets. RESULTS Comparing with the previous segmentation methods, the experimental results on three bacteria genomes, Rickettsia prowazekii, Borrelia burgdorferi and E.coli, show that our approach improves the accuracy for finding the borders between coding and non-coding regions in DNA sequences. CONCLUSIONS This paper presents a new segmentation method in prokaryotes based on Jensen-Rényi divergence with a 22-symbol alphabet. For three bacteria genomes, comparing to A12_JR method, our method raised the accuracy of finding the borders between protein coding and non-coding regions in DNA sequences.
Collapse
|
27
|
Bonneville R, Jin VX. A hidden Markov model to identify combinatorial epigenetic regulation patterns for estrogen receptor α target genes. ACTA ACUST UNITED AC 2012; 29:22-8. [PMID: 23104890 DOI: 10.1093/bioinformatics/bts639] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
MOTIVATION Many studies have shown that epigenetic changes, such as altered DNA methylation and histone modifications, are linked to estrogen receptor α (ERα)-positive tumors and disease prognoses. Several recent studies have applied high-throughput technologies such as ChIP-seq and MBD-seq to interrogate the altered architectures of ERα regulation in tamoxifen (Tam)-resistant breast cancer cells. However, the details of combinatorial epigenetic regulation of ERα target genes in breast cancers with acquired Tam resistance have not yet been fully examined. RESULTS We developed a computational approach to identify and analyze epigenetic patterns associated with Tam resistance in the MCF7-T cell line as opposed to the Tam-sensitive MCF7 cell line, with the goal of understanding the underlying mechanisms of epigenetic regulatory influence on resistance to Tam treatment in breast cancer. In this study, we used ChIP-seq of ERα, RNA polymerase II, three histone modifications and MBD-seq data of DNA methylation in MCF7 and MCF7-T cells to train hidden Markov models (HMMs). We applied the Bayesian information criterion to determine that a 20-state HMM was best, which was reduced to a 14-state HMM with a Bayesian information criterion score of 1.21291 × 10(7). We further identified four classes of biologically meaningful states in this breast cancer cell model system, and a set of ERα combinatorial epigenetic regulated target genes. The correlated gene expression level and gene ontology analyses showed that different gene ontology terms were enriched with Tam-resistant versus sensitive breast cancer cells. Our study illustrates the applicability of HMM-based analysis of genome-wide high-throughput genomic data to study epigenetic influences on E2/ERα regulation in breast cancer.
Collapse
Affiliation(s)
- Russell Bonneville
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH 43210, USA
| | | |
Collapse
|
28
|
Zhang L, Tian F, Wang S. A modified statistically optimal null filter method for recognizing protein-coding regions. GENOMICS PROTEOMICS & BIOINFORMATICS 2012; 10:166-73. [PMID: 22917190 PMCID: PMC5054498 DOI: 10.1016/j.gpb.2012.02.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/15/2011] [Revised: 02/04/2012] [Accepted: 02/21/2012] [Indexed: 11/21/2022]
Abstract
Computer-aided protein-coding gene prediction in uncharacterized genomic DNA sequences is one of the most important issues of biological signal processing. A modified filter method based on a statistically optimal null filter (SONF) theory is proposed for recognizing protein-coding regions. The square deviation gain (SDG) between the input and output of the model is used to identify the coding regions. The effective SDG amplification model with Class I and Class II enhancement is designed to suppress the non-coding regions. Also, an evaluation algorithm has been used to compare the modified model with most gene prediction methods currently available in terms of sensitivity, specificity and precision. The performance for identification of protein-coding regions has been evaluated at the nucleotide level using benchmark datasets and 91.4%, 96%, 93.7% were obtained for sensitivity, specificity and precision, respectively. These results suggest that the proposed model is potentially useful in gene finding field, which can help recognize protein-coding regions with higher precision and speed than present algorithms.
Collapse
Affiliation(s)
- Lei Zhang
- College of Communication Engineering, Chongqing University, Chongqing 400044, China.
| | | | | |
Collapse
|
29
|
Abstract
MOTIVATION Probabilistic logic programming offers a powerful way to describe and evaluate structured statistical models. To investigate the practicality of probabilistic logic programming for structure learning in bioinformatics, we undertook a simplified bacterial gene-finding benchmark in PRISM, a probabilistic dialect of Prolog. RESULTS We evaluate Hidden Markov Model structures for bacterial protein-coding gene potential, including a simple null model structure, three structures based on existing bacterial gene finders and two novel model structures. We test standard versions as well as ADPH length modeling and three-state versions of the five model structures. The models are all represented as probabilistic logic programs and evaluated using the PRISM machine learning system in terms of statistical information criteria and gene-finding prediction accuracy, in two bacterial genomes. Neither of our implementations of the two currently most used model structures are best performing in terms of statistical information criteria or prediction performances, suggesting that better-fitting models might be achievable. AVAILABILITY The source code of all PRISM models, data and additional scripts are freely available for download at: http://github.com/somork/codonhmm. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Søren Mørk
- Department of Science, Systems and Models, Roskilde University, 4000 Roskilde, Denmark.
| | | |
Collapse
|
30
|
Abstract
Evolutionary genomics is a field that relies heavily upon comparing genomes, that is, the full complement of genes of one species with another. However, given a genome sequence and little else, as is now often the case, genes must first be found and annotated before downstream analyses can be done. Computational gene prediction techniques are brought to bear on the problem of constructing a genome annotation as manual annotation is extremely time-consuming and costly. This chapter reviews the methods by which the individual components of a typical gene structure are detected in genomic sequence and then discusses several popular statistical frameworks for integrated gene prediction on eukaryotic genome sequences.
Collapse
Affiliation(s)
- Tyler Alioto
- Centro Nacional de Análisis Genómico, Barcelona, Spain.
| |
Collapse
|
31
|
Chen B, Ji P. Numericalization of the self adaptive spectral rotation method for coding region prediction. J Theor Biol 2011; 296:95-102. [PMID: 22178641 DOI: 10.1016/j.jtbi.2011.12.002] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2011] [Revised: 10/24/2011] [Accepted: 12/01/2011] [Indexed: 11/27/2022]
Abstract
Recently, for identifying protein coding regions in new sequences from unknown organisms without training sets, a Self Adaptive Spectral Rotation (SASR) method has been developed to visualize the Triplet Periodicity (TP) property, which is a simple and universal coding related property. The rough locations of coding regions can be visually revealed by the SASR method, without any training. However, the method does not numerically discriminate the locations of coding regions. Based on the SASR method, we develop a new approach, named the T-Z-T analysis, to provide numerical results of coding region prediction. This approach adopts a t-test segmentation to separate coding and non-coding regions in the SASR's output and further uses a z-test filter to recognize region patterns. After that, another t-test segmentation is conducted to break down adjacent coding regions by detecting the frame shifts. Since it is based on the graphic output of the SASR, this approach does not require any training. Meanwhile, this approach is more stable, because it is not sensitive to errors in the input DNA sequence. Such advantages make it suitable for coding region prediction in the early stage, when there is insufficient training set, and even the input data are inaccurate.
Collapse
Affiliation(s)
- Bo Chen
- College of Mathematics and Computer Science, Fuzhou University, China.
| | | |
Collapse
|
32
|
Suvorova YM, Rudenko VM, Korotkov EV. Detection change points of triplet periodicity of gene. Gene 2011; 491:58-64. [PMID: 21982972 DOI: 10.1016/j.gene.2011.08.032] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2011] [Revised: 08/10/2011] [Accepted: 08/25/2011] [Indexed: 10/17/2022]
Abstract
The triplet periodicity (TP) is a distinguished property of protein coding sequences. There are complex genes with more than one TP type along their sequence. We say that these genes contain a triplet periodicity change point. The aim of the work is to find all genes that contain TP change point and attempt to compare the positions of change point in genes with known biological data. We have developed a mathematical method to identify triplet periodicity changes along a sequence. We have found 311,221 genes with the TP change point in the KEGG/Genes database (version 48). It is about 8% from the total database volume (4013150). We showed that the repetitive sequences are not the only cause of such events. We suppose that the TP change point may indicate a fusion of genes or domains. We performed BLAST analysis to find potential ancestral genes for the parts of genes with TP change point. As a result we found that in 131323 cases sequences with TP change point have proper similarities for one or both parts. The relationship between TP change point and the fusion events in genes is discussed. The program realization of the method is available by request to authors.
Collapse
Affiliation(s)
- Yulia M Suvorova
- Bioinfomatics Laboratory, Centre of Bioengineering, Russian Academy of Sciences, 117312, Moscow, Prospect 60-tya Oktyabrya, 7/1, Russia.
| | | | | |
Collapse
|
33
|
Sahu SS, Panda G. Identification of protein-coding regions in DNA sequences using a time-frequency filtering approach. GENOMICS, PROTEOMICS & BIOINFORMATICS 2011; 9:45-55. [PMID: 21641562 PMCID: PMC5054166 DOI: 10.1016/s1672-0229(11)60007-7] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/19/2010] [Accepted: 10/31/2010] [Indexed: 11/13/2022]
Abstract
Accurate identification of protein-coding regions (exons) in DNA sequences has been a challenging task in bioinformatics. Particularly the coding regions have a 3-base periodicity, which forms the basis of all exon identification methods. Many signal processing tools and techniques have been applied successfully for the identification task but still improvement in this direction is needed. In this paper, we have introduced a new promising model-independent time-frequency filtering technique based on S-transform for accurate identification of the coding regions. The S-transform is a powerful linear time-frequency representation useful for filtering in time-frequency domain. The potential of the proposed technique has been assessed through simulation study and the results obtained have been compared with the existing methods using standard datasets. The comparative study demonstrates that the proposed method outperforms its counterparts in identifying the coding regions.
Collapse
Affiliation(s)
- Sitanshu Sekhar Sahu
- Department of Electronics and Communication Engineering, National Institute of Technology, Rourkela, India.
| | | |
Collapse
|
34
|
Machado-Lima A, Kashiwabara AY, Durham AM. Decreasing the number of false positives in sequence classification. BMC Genomics 2010; 11 Suppl 5:S10. [PMID: 21210966 PMCID: PMC3045793 DOI: 10.1186/1471-2164-11-s5-s10] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
Background A large number of probabilistic models used in sequence analysis assign non-zero probability values to most input sequences. To decide when a given probability is sufficient the most common way is bayesian binary classification, where the probability of the model characterizing the sequence family of interest is compared to that of an alternative probability model. We can use as alternative model a null model. This is the scoring technique used by sequence analysis tools such as HMMER, SAM and INFERNAL. The most prevalent null models are position-independent residue distributions that include: the uniform distribution, genomic distribution, family-specific distribution and the target sequence distribution. This paper presents a study to evaluate the impact of the choice of a null model in the final result of classifications. In particular, we are interested in minimizing the number of false predictions in a classification. This is a crucial issue to reduce costs of biological validation. Results For all the tests, the target null model presented the lowest number of false positives, when using random sequences as a test. The study was performed in DNA sequences using GC content as the measure of content bias, but the results should be valid also for protein sequences. To broaden the application of the results, the study was performed using randomly generated sequences. Previous studies were performed on aminoacid sequences, using only one probabilistic model (HMM) and on a specific benchmark, and lack more general conclusions about the performance of null models. Finally, a benchmark test with P. falciparum confirmed these results. Conclusions Of the evaluated models the best suited for classification are the uniform model and the target model. However, the use of the uniform model presents a GC bias that can cause more false positives for candidate sequences with extreme compositional bias, a characteristic not described in previous studies. In these cases the target model is more dependable for biological validation due to its higher specificity.
Collapse
Affiliation(s)
- Ariane Machado-Lima
- Escola de Artes, Ciências e Humanidades, Universidade de São Paulo, Rua Arlindo Béttio, 1000, 03828-000, São Paulo, SP, Brazil
| | | | | |
Collapse
|
35
|
Chen B, Ji P. Visualization of the protein-coding regions with a self adaptive spectral rotation approach. Nucleic Acids Res 2010; 39:e3. [PMID: 20947567 PMCID: PMC3017620 DOI: 10.1093/nar/gkq891] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Identifying protein-coding regions in DNA sequences is an active issue in computational biology. In this study, we present a self adaptive spectral rotation (SASR) approach, which visualizes coding regions in DNA sequences, based on investigation of the Triplet Periodicity property, without any preceding training process. It is proposed to help with the rough coding regions prediction when there is no extra information for the training required by other outstanding methods. In this approach, at each position in the DNA sequence, a Fourier spectrum is calculated from the posterior subsequence. Following the spectrums, a random walk in complex plane is generated as the SASR's graphic output. Applications of the SASR on real DNA data show that patterns in the graphic output reveal locations of the coding regions and the frame shifts between them: arcs indicate coding regions, stable points indicate non-coding regions and corners’ shapes reveal frame shifts. Tests on genomic data set from Saccharomyces Cerevisiae reveal that the graphic patterns for coding and non-coding regions differ to a great extent, so that the coding regions can be visually distinguished. Meanwhile, a time cost test shows that the SASR can be easily implemented with the computational complexity of O(N).
Collapse
Affiliation(s)
- Bo Chen
- Department of Industrial and Systems Engineering, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong.
| | | |
Collapse
|
36
|
Zeng J, Alhajj R, Demetrick D. Adaptive multi-agent architecture for functional sequence motifs recognition. Bioinformatics 2009; 25:3084-92. [PMID: 19808882 DOI: 10.1093/bioinformatics/btp567] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Accurate genome annotation or protein function prediction requires precise recognition of functional sequence motifs. Many computational motif prediction models have been proposed. Due to the complexity of the biological data, it may be desirable to apply an integrated approach that uses multiple models for analysis. RESULTS In this article, we propose a novel multi-agent architecture for the general purpose of functional sequence motif recognition. The approach takes advantage of the synergy provided by multiple agents through the employment of different agents equipped with distinctive problem solving skills and promotes the collaborations among them through decision maker (DM) agents that work as classifier ensembles. A genetic algorithm-based fusion strategy is applied which offers evolutionary property to the DM agents. The consistency and robustness of the system are maintained by an evolvable agent that mediates the team of the ensemble agents. The combined effort of a recommendation system (Seer) and the self-learning mediator agent yields a successful identification of the most efficient agent deployment scheme at an early stage of the experimentation process, which has the potential of greatly reducing the computational cost of the system. Two concrete systems are constructed that aim at predicting two important sequence motifs-the translational initiation sites (TISs) and the core promoters. With the incorporation of three distinctive problem solver agents, the TIS predictor consistently outperforms most of the state-of-the-art approaches under investigation. Integrating three existing promoter predictors, our system is able to yield consistently good performance. AVAILABILITY The program (MotifMAS) and the datasets are available upon request.
Collapse
Affiliation(s)
- Jia Zeng
- Department of Computer Science, University of Calgary, Calgary, AB, Canada.
| | | | | |
Collapse
|
37
|
|
38
|
Frenkel FE, Korotkov EV. Using triplet periodicity of nucleotide sequences for finding potential reading frame shifts in genes. DNA Res 2009; 16:105-14. [PMID: 19261626 PMCID: PMC2671204 DOI: 10.1093/dnares/dsp002] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
We introduce a novel approach for the detection of possible mutations leading to a reading frame (RF) shift in a gene. Deletions and insertions of DNA coding regions are considerable events for genes because an RF shift results in modifications of the extensive region of amino acid sequence coded by a gene. The suggested method is based on the phenomenon of triplet periodicity (TP) in coding regions of genes and its relative resistance to substitutions in DNA sequence. We attempted to extend 326 933 regions of continuous TP found in genes from the KEGG databank by considering possible insertions and deletions. We revealed totally 824 genes where such extension was possible and statistically significant. Then we generated amino acid sequences according to active (KEGG's) and hypothetically ancient RFs in order to find confirmation of a shift at a protein level. Consequently, 64 sequences have protein similarities only for ancient RF, 176 only for active RF, 3 for both and 581 have no protein similarity at all. We aimed to have revealed lower bound for the number of genes in which a shift between RF and TP is possible. Further ways to increase the number of revealed RF shifts are discussed.
Collapse
Affiliation(s)
- F E Frenkel
- Bioengineering Centre of RAS, 60-letiya Oktyabrya prosp., 7/1, Moscow, Russia.
| | | |
Collapse
|
39
|
Galimov AR, Kruglov AA, Bol'sheva NL, Iurkevich OI, Lipin'sh DI, Mufazalov IA, Kuprash DV, Nedospasov SA. [Chromosomal localization and molecular organization of human genomic fragment containing TNF/LT locus in transgenic mice]. Mol Biol (Mosk) 2008; 42:629-38. [PMID: 18856063 DOI: 10.1134/s0026893308040201] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
Abstract
Molecular organization, copy number and chromosomal localization of human TNF/LT locus fragment were determined in genomes of two transgenic mouse lines. Genome of the first one contains two copies, organized in head-to-tail manner and determined on eighth chromosome by karyotyping; single transgene copy of the second line is observed on the fifth chromosome. These mice could serve as valuable model for studying both human tumor necrosis factor and lymphotoxin physiological functions.
Collapse
|
40
|
Frenkel FE, Korotkov EV. Classification analysis of triplet periodicity in protein-coding regions of genes. Gene 2008; 421:52-60. [PMID: 18593596 DOI: 10.1016/j.gene.2008.06.012] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2008] [Revised: 05/14/2008] [Accepted: 06/06/2008] [Indexed: 11/16/2022]
Abstract
We introduce a new concept of triplet periodicity class (TPC) and a measure of similarity between such classes. We performed classification of 472288 triplet periodicity (TP) regions found in 578868 genes from 29th release of KEGG databank. Totally 2520 classes were obtained. They contain 94% of 472288 found cases of TP. For 92% of TP regions contained in classes the same linkage of TP to open reading frame (ORF) is observed. For 8% of TP cases we revealed a shift between ORF of a gene and ORF common for majority of genes contained in a TPC. For these 8% of periodic regions the hypothetical amino acid sequences corresponding to ORF built by TPC were made. BLAST program has shown that 2679 hypothetical amino acid sequences have statistically significant similarity with proteins from UniProt databank. We suppose that 8% of TP regions contained in classes possess a mutation originating from ORF shift. Obtained TPCs can be used for identification of genes' coding regions as well as for searching for mutations arisen arising from ORF shift.
Collapse
Affiliation(s)
- F E Frenkel
- Bioengineering Centre of RAS, Moscow, Russia.
| | | |
Collapse
|
41
|
|
42
|
Melodelima C, Gautier C, Piau D. A markovian approach for the prediction of mouse isochores. J Math Biol 2007; 55:353-64. [PMID: 17486342 DOI: 10.1007/s00285-007-0087-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2006] [Revised: 03/01/2007] [Indexed: 10/23/2022]
Abstract
Hidden Markov models (HMMs) are effective tools to detect series of statistically homogeneous structures, but they are not well suited to analyse complex structures. For example, the duration of stay in a state of a HMM must follow a geometric law. Numerous other methodological difficulties are encountered when using HMMs to segregate genes from transposons or retroviruses, or to determine the isochore classes of genes. The aim of this paper is to analyse these methodological difficulties, and to suggest new tools for the exploration of genome data. We show that HMMs can be used to analyse complex gene structures with bell-shaped length distribution by using convolution of geometric distributions. Thus, we have introduced macros-states to model the distributions of the lengths of the regions. Our study shows that simple HMM could be used to model the isochore organisation of the mouse genome. This potential use of markovian models to help in data exploration has been underestimated until now.
Collapse
Affiliation(s)
- Christelle Melodelima
- UMR 5558 CNRS Biométrie et Biologie Evolutive, Université Claude Bernard Lyon 1, 43 boulevard du 11 Novembre 1818, 69622 Villeurbanne Cedex, France.
| | | | | |
Collapse
|
43
|
Keibler E, Arumugam M, Brent MR. The Treeterbi and Parallel Treeterbi algorithms: efficient, optimal decoding for ordinary, generalized and pair HMMs. Bioinformatics 2007; 23:545-54. [PMID: 17237054 DOI: 10.1093/bioinformatics/btl659] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Hidden Markov models (HMMs) and generalized HMMs been successfully applied to many problems, but the standard Viterbi algorithm for computing the most probable interpretation of an input sequence (known as decoding) requires memory proportional to the length of the sequence, which can be prohibitive. Existing approaches to reducing memory usage either sacrifice optimality or trade increased running time for reduced memory. RESULTS We developed two novel decoding algorithms, Treeterbi and Parallel Treeterbi, and implemented them in the TWINSCAN/N-SCAN gene-prediction system. The worst case asymptotic space and time are the same as for standard Viterbi, but in practice, Treeterbi optimally decodes arbitrarily long sequences with generalized HMMs in bounded memory without increasing running time. Parallel Treeterbi uses the same ideas to split optimal decoding across processors, dividing latency to completion by approximately the number of available processors with constant average overhead per processor. Using these algorithms, we were able to optimally decode all human chromosomes with N-SCAN, which increased its accuracy relative to heuristic solutions. We also implemented Treeterbi for Pairagon, our pair HMM based cDNA-to-genome aligner. AVAILABILITY The TWINSCAN/N-SCAN/PAIRAGON open source software package is available from http://genes.cse.wustl.edu.
Collapse
Affiliation(s)
- Evan Keibler
- Laboratory for Computational Genomics, Campus Box 1045, Washington University, St. Louis, MO 63130, USA
| | | | | |
Collapse
|
44
|
Segovia-Juarez JL, Colombano S, Kirschner D. Identifying DNA splice sites using hypernetworks with artificial molecular evolution. Biosystems 2006; 87:117-24. [PMID: 17116361 DOI: 10.1016/j.biosystems.2006.09.004] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2005] [Revised: 07/08/2006] [Accepted: 07/15/2006] [Indexed: 11/28/2022]
Abstract
Identifying DNA splice sites is a main task of gene hunting. We introduce the hyper-network architecture as a novel method for finding DNA splice sites. The hypernetwork architecture is a biologically inspired information processing system composed of networks of molecules forming cells, and a number of cells forming a tissue or organism. Its learning is based on molecular evolution. DNA examples taken from GenBank were translated into binary strings and fed into a hypernetwork for training. We performed experiments to explore the generalization performance of hypernetwork learning in this data set by two-fold cross validation. The hypernetwork generalization performance was comparable to well known classification algorithms. With the best hypernetwork obtained, including local information and heuristic rules, we built a system (HyperExon) to obtain splice site candidates. The HyperExon system outperformed leading splice recognition systems in the list of sequences tested.
Collapse
Affiliation(s)
- Jose L Segovia-Juarez
- Department of Microbiology and Immunology, University of Michigan, Ann Arbor, MI, USA
| | | | | |
Collapse
|
45
|
Abstract
This article introduces the field of bioinformatics and describes bioinformatic approaches and their application to the study of protein allergens. The predominant bioinformatics tools and resources are listed and discussed.
Collapse
Affiliation(s)
- Pinar Kondu Akalin
- Iontek, Meridyen Is Merkezi Ali Riza Gurcan Cad. Cirpici Yolu, Istanbul 34010, Turkey.
| |
Collapse
|
46
|
|
47
|
Dutta S, Singhal P, Agrawal P, Tomer R, Kritee K, Khurana E, Jayaram B. A physicochemical model for analyzing DNA sequences. J Chem Inf Model 2006; 46:78-85. [PMID: 16426042 DOI: 10.1021/ci050119x] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
In search of an ab initio model to characterize DNA sequences as genes and nongenes, we examined some physicochemical properties of each trinucleotide (codon), which could accomplish this task. We constructed three-dimensional vectors for each double-helical trinucleotide sequence considering hydrogen-bonding energy, stacking energy, and a third parameter, which we provisionally identified with DNA-protein interactions. As this three-dimensional vector moves along any genome, the net orientation of the resultant vector should differ significantly for gene and nongene regions to make a distinction feasible, if the underlying model has some merits. An analysis of 331 prokaryotic genomes comprising a total of 294 786 experimentally verified genes (nonoverlapping) and an equal number of nongenes presents a proof of concept of the model without the need for further parametrization. Also, initial analyses on Saccharomyces cerevisiae and Arabidopsis thaliana suggest that the methodology is extendable to eukaryotes. The physicochemical model (ChemGenome1.0) introduced has the potential to be developed into a gene-finding algorithm and, more pressingly, could be employed for an independent assessment of the annotation of DNA sequences.
Collapse
Affiliation(s)
- Samrat Dutta
- Department of Chemistry and Supercomputing Facility for Bioinformatics and Computational Biology, Indian Institute of Technology, Hauz Khas, New Delhi
| | | | | | | | | | | | | |
Collapse
|
48
|
Huang J, Li T, Chen K, Wu J. An approach of encoding for prediction of splice sites using SVM. Biochimie 2006; 88:923-9. [PMID: 16626852 DOI: 10.1016/j.biochi.2006.03.006] [Citation(s) in RCA: 43] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2004] [Revised: 03/06/2006] [Accepted: 03/09/2006] [Indexed: 11/18/2022]
Abstract
In splice sites prediction, the accuracy is lower than 90% though the sequences adjacent to the splice sites have a high conservation. In order to improve the prediction accuracy, much attention has been paid to the improvement of the performance of the algorithms used, and few used for solving the fundamental issues, namely, nucleotide encoding. In this paper, a predictor is constructed to predict the true and false splice sites for higher eukaryotes based on support vector machines (SVM). Four types of encoding, which were mono-nucleotide (MN) encoding, MN with frequency difference between the true sites and false sites (FDTF) encoding, Pair-wise nucleotides (PN) encoding and PN with FDTF encoding, were applied to generate the input for the SVM. The results showed that PN with FDTF encoding as input to SVM led to the most reliable recognition of splice sites and the accuracy for the prediction of true donor sites and false sites were 96.3%, 93.7%, respectively, and the accuracy for predicting of true acceptor sites and false sites were 94.0%, 93.2%, respectively.
Collapse
Affiliation(s)
- J Huang
- Department of Chemistry, Tongji University, Shanghai, China
| | | | | | | |
Collapse
|
49
|
Sczyrba A, Beckstette M, Brivanlou AH, Giegerich R, Altmann CR. XenDB: full length cDNA prediction and cross species mapping in Xenopus laevis. BMC Genomics 2005; 6:123. [PMID: 16162280 PMCID: PMC1261260 DOI: 10.1186/1471-2164-6-123] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2005] [Accepted: 09/14/2005] [Indexed: 11/23/2022] Open
Abstract
Background Research using the model system Xenopus laevis has provided critical insights into the mechanisms of early vertebrate development and cell biology. Large scale sequencing efforts have provided an increasingly important resource for researchers. To provide full advantage of the available sequence, we have analyzed 350,468 Xenopus laevis Expressed Sequence Tags (ESTs) both to identify full length protein encoding sequences and to develop a unique database system to support comparative approaches between X. laevis and other model systems. Description Using a suffix array based clustering approach, we have identified 25,971 clusters and 40,877 singleton sequences. Generation of a consensus sequence for each cluster resulted in 31,353 tentative contig and 4,801 singleton sequences. Using both BLASTX and FASTY comparison to five model organisms and the NR protein database, more than 15,000 sequences are predicted to encode full length proteins and these have been matched to publicly available IMAGE clones when available. Each sequence has been compared to the KOG database and ~67% of the sequences have been assigned a putative functional category. Based on sequence homology to mouse and human, putative GO annotations have been determined. Conclusion The results of the analysis have been stored in a publicly available database XenDB . A unique capability of the database is the ability to batch upload cross species queries to identify potential Xenopus homologues and their associated full length clones. Examples are provided including mapping of microarray results and application of 'in silico' analysis. The ability to quickly translate the results of various species into 'Xenopus-centric' information should greatly enhance comparative embryological approaches. Supplementary material can be found at .
Collapse
Affiliation(s)
- Alexander Sczyrba
- AG Praktische Informatik, Technische Fakultät, Universität Bielefeld, D-33594 Bielefeld, Germany
| | - Michael Beckstette
- AG Praktische Informatik, Technische Fakultät, Universität Bielefeld, D-33594 Bielefeld, Germany
| | - Ali H Brivanlou
- The Rockefeller University, Laboratory of Molecular Vertebrate Embryology, 1230 York Avenue, New York, NY 10021, USA
| | - Robert Giegerich
- AG Praktische Informatik, Technische Fakultät, Universität Bielefeld, D-33594 Bielefeld, Germany
| | - Curtis R Altmann
- FSU College of Medicine, Department of Biomedical Sciences, 1269 W. Call Street, Tallahassee, FL 32306, USA
| |
Collapse
|
50
|
A neural network based multi-classifier system for gene identification in DNA sequences. Neural Comput Appl 2004. [DOI: 10.1007/s00521-004-0447-7] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|