1
|
Chen H, Liu Y, Balabani S, Hirayama R, Huang J. Machine Learning in Predicting Printable Biomaterial Formulations for Direct Ink Writing. RESEARCH (WASHINGTON, D.C.) 2023; 6:0197. [PMID: 37469394 PMCID: PMC10353544 DOI: 10.34133/research.0197] [Citation(s) in RCA: 26] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/06/2023] [Accepted: 06/29/2023] [Indexed: 07/21/2023]
Abstract
Three-dimensional (3D) printing is emerging as a transformative technology for biomedical engineering. The 3D printed product can be patient-specific by allowing customizability and direct control of the architecture. The trial-and-error approach currently used for developing the composition of printable inks is time- and resource-consuming due to the increasing number of variables requiring expert knowledge. Artificial intelligence has the potential to reshape the ink development process by forming a predictive model for printability from experimental data. In this paper, we constructed machine learning (ML) algorithms including decision tree, random forest (RF), and deep learning (DL) to predict the printability of biomaterials. A total of 210 formulations including 16 different bioactive and smart materials and 4 solvents were 3D printed, and their printability was assessed. All ML methods were able to learn and predict the printability of a variety of inks based on their biomaterial formulations. In particular, the RF algorithm has achieved the highest accuracy (88.1%), precision (90.6%), and F1 score (87.0%), indicating the best overall performance out of the 3 algorithms, while DL has the highest recall (87.3%). Furthermore, the ML algorithms have predicted the printability window of biomaterials to guide the ink development. The printability map generated with DL has finer granularity than other algorithms. ML has proven to be an effective and novel strategy for developing biomaterial formulations with desired 3D printability for biomedical engineering applications.
Collapse
Affiliation(s)
- Hongyi Chen
- Department of Mechanical Engineering,
University College London, London, UK
- Department of Computer Science,
University College London, London, UK
| | - Yuanchang Liu
- Department of Mechanical Engineering,
University College London, London, UK
| | - Stavroula Balabani
- Department of Mechanical Engineering,
University College London, London, UK
- Wellcome-EPSRC Centre for Interventional Surgical Sciences (WEISS),
University College London, London, UK
| | - Ryuji Hirayama
- Department of Computer Science,
University College London, London, UK
| | - Jie Huang
- Department of Mechanical Engineering,
University College London, London, UK
| |
Collapse
|
2
|
Wang Y, Cai X, Hu S, Qin S, Wang Z, Cao Y, Hou C, Yang J, Zhou W. Comparative genomic analysis provides insight into the phylogeny and potential mechanisms of adaptive evolution of Sphingobacterium sp. CZ-2. Gene 2023; 855:147118. [PMID: 36521669 DOI: 10.1016/j.gene.2022.147118] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2022] [Revised: 11/21/2022] [Accepted: 12/09/2022] [Indexed: 12/14/2022]
Abstract
Sphingobacterium is a class of Gram-negative, non-fermentative bacilli that have received widespread attention due to their broad ecological distribution and oil degradation ability, but are rarely involved in infections. In this manuscript, a novel Sphingobacterium strain isolated from wildfire-infected tobacco leaves was named Sphingobacterium sp. CZ-2. NGS and TGS sequencing results showed a whole genome of 3.92 Mb with 40.68 mol% GC content and containing 3,462 protein-coding genes, 9 rRNA-coding genes and 50 tRNA-coding genes. Phylogenetic analysis, ANI and dDDH calculations all supported that Sphingobacterium sp. CZ-2 represented a novel species of the genus Sphingobacterium. Analysis of the specific genes of Sphingobacterium sp. CZ-2 by comparative genomics revealed that metal transport proteins encoded by the troD and cusA genes could maintain the balance of heavy metal ion concentrations in the internal environment of bacteria and avoid heavy metal toxicity while meeting the needs of growth and reproduction, and transport proteins encoded by the malG gene could keep nutrients required for the survival of bacteria. Synteny and genome evolutionary analyses of Sphingobacterium strains implicated that the gene family contraction as a major process in genome evolution, with insertional sequences leading to mutations, deletions and reversals of genes that help bacteria to withstand complex environmental changes. Complete genome sequencing and systematic comparative genomic analysis will contribute new insights into the adaptive evolution of this novel species and the genus Sphingobacterium.
Collapse
Affiliation(s)
- Yongqiang Wang
- Hunan Provincial Engineering & Technology Research Center for Agricultural Big Data Analysis & Decision-Making, Hunan Agricultural University, Changsha 410128, China; Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Hunan Agricultural University, Changsha 410128, China
| | - Xunhui Cai
- School of Electronic Information and Communications, Huazhong University of Science and Technology, Wuhan 430074, China
| | - Shengnan Hu
- Hunan Provincial Engineering & Technology Research Center for Agricultural Big Data Analysis & Decision-Making, Hunan Agricultural University, Changsha 410128, China; Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Hunan Agricultural University, Changsha 410128, China
| | - Sidong Qin
- Hunan Provincial Engineering & Technology Research Center for Agricultural Big Data Analysis & Decision-Making, Hunan Agricultural University, Changsha 410128, China; Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Hunan Agricultural University, Changsha 410128, China
| | - Ziqi Wang
- Hunan Provincial Engineering & Technology Research Center for Agricultural Big Data Analysis & Decision-Making, Hunan Agricultural University, Changsha 410128, China; Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Hunan Agricultural University, Changsha 410128, China
| | - Yixiang Cao
- Hunan Provincial Engineering & Technology Research Center for Agricultural Big Data Analysis & Decision-Making, Hunan Agricultural University, Changsha 410128, China; Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Hunan Agricultural University, Changsha 410128, China
| | - Chaoliang Hou
- Hunan Provincial Engineering & Technology Research Center for Agricultural Big Data Analysis & Decision-Making, Hunan Agricultural University, Changsha 410128, China; Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Hunan Agricultural University, Changsha 410128, China
| | - Jiangshan Yang
- Hunan Provincial Engineering & Technology Research Center for Agricultural Big Data Analysis & Decision-Making, Hunan Agricultural University, Changsha 410128, China; Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Hunan Agricultural University, Changsha 410128, China
| | - Wei Zhou
- Hunan Provincial Engineering & Technology Research Center for Agricultural Big Data Analysis & Decision-Making, Hunan Agricultural University, Changsha 410128, China; Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Hunan Agricultural University, Changsha 410128, China.
| |
Collapse
|
3
|
Patra P, B R D, Kundu P, Das M, Ghosh A. Recent advances in machine learning applications in metabolic engineering. Biotechnol Adv 2023; 62:108069. [PMID: 36442697 DOI: 10.1016/j.biotechadv.2022.108069] [Citation(s) in RCA: 26] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2022] [Revised: 10/18/2022] [Accepted: 11/22/2022] [Indexed: 11/27/2022]
Abstract
Metabolic engineering encompasses several widely-used strategies, which currently hold a high seat in the field of biotechnology when its potential is manifesting through a plethora of research and commercial products with a strong societal impact. The genomic revolution that occurred almost three decades ago has initiated the generation of large omics-datasets which has helped in gaining a better understanding of cellular behavior. The itinerary of metabolic engineering that has occurred based on these large datasets has allowed researchers to gain detailed insights and a reasonable understanding of the intricacies of biosystems. However, the existing trail-and-error approaches for metabolic engineering are laborious and time-intensive when it comes to the production of target compounds with high yields through genetic manipulations in host organisms. Machine learning (ML) coupled with the available metabolic engineering test instances and omics data brings a comprehensive and multidisciplinary approach that enables scientists to evaluate various parameters for effective strain design. This vast amount of biological data should be standardized through knowledge engineering to train different ML models for providing accurate predictions in gene circuits designing, modification of proteins, optimization of bioprocess parameters for scaling up, and screening of hyper-producing robust cell factories. This review briefs on the premise of ML, followed by mentioning various ML methods and algorithms alongside the numerous omics datasets available to train ML models for predicting metabolic outcomes with high-accuracy. The combinative interplay between the ML algorithms and biological datasets through knowledge engineering have guided the recent advancements in applications such as CRISPR/Cas systems, gene circuits, protein engineering, metabolic pathway reconstruction, and bioprocess engineering. Finally, this review addresses the probable challenges of applying ML in metabolic engineering which will guide the researchers toward novel techniques to overcome the limitations.
Collapse
Affiliation(s)
- Pradipta Patra
- School School of Energy Science and Engineering, Indian Institute of Technology Kharagpur, West Bengal 721302, India
| | - Disha B R
- B.M.S College of Engineering, Basavanagudi, Bengaluru, Karnataka 560019, India
| | - Pritam Kundu
- School School of Energy Science and Engineering, Indian Institute of Technology Kharagpur, West Bengal 721302, India
| | - Manali Das
- School of Bioscience, Indian Institute of Technology Kharagpur, West Bengal 721302, India
| | - Amit Ghosh
- School School of Energy Science and Engineering, Indian Institute of Technology Kharagpur, West Bengal 721302, India; P.K. Sinha Centre for Bioenergy and Renewables, Indian Institute of Technology Kharagpur, West Bengal 721302, India.
| |
Collapse
|
4
|
Karanth S, Tanui CK, Meng J, Pradhan AK. Exploring the predictive capability of advanced machine learning in identifying severe disease phenotype in Salmonella enterica. Food Res Int 2022; 151:110817. [PMID: 34980422 DOI: 10.1016/j.foodres.2021.110817] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2021] [Revised: 11/12/2021] [Accepted: 11/17/2021] [Indexed: 11/26/2022]
Abstract
The past few years have seen a significant increase in availability of whole genome sequencing information, allowing for its incorporation in predictive modeling for foodborne pathogens to account for inter- and intra-species differences in their virulence. However, this is hindered by the inability of traditional statistical methods to analyze such large amounts of data compared to the number of observations/isolates. In this study, we have explored the applicability of machine learning (ML) models to predict the disease outcome, while identifying features that exert a significant effect on the prediction. This study was conducted on Salmonella enterica, a major foodborne pathogen with considerable inter- and intra-serovar variation. WGS of isolates obtained from various sources (i.e., human, chicken, and swine) were used as input in four machine learning models (logistic regression with ridge, random forest, support vector machine, and AdaBoost) to classify isolates based on disease severity (extraintestinal vs. gastrointestinal) in the host. The predictive performances of all models were tested with and without Elastic Net regularization to combat dimensionality issues. Elastic Net-regularized logistic regression model showed the best area under the receiver operating characteristic curve (AUC-ROC; 0.86) and outcome prediction accuracy (0.76). Additionally, genes coding for transcriptional regulation, acidic, oxidative, and anaerobic stress response, and antibiotic resistance were found to be significant predictors of disease severity. These genes, which were significantly associated with each outcome, could possibly be input in amended, gene-expression-specific predictive models to estimate virulence pattern-specific effect of Salmonella and other foodborne pathogens on human health.
Collapse
Affiliation(s)
- Shraddha Karanth
- Department of Nutrition and Food Science, University of Maryland, College Park, MD 20742, USA
| | - Collins K Tanui
- Department of Nutrition and Food Science, University of Maryland, College Park, MD 20742, USA; Center for Food Safety and Security Systems, University of Maryland, College Park, MD 20742, USA
| | - Jianghong Meng
- Department of Nutrition and Food Science, University of Maryland, College Park, MD 20742, USA; Center for Food Safety and Security Systems, University of Maryland, College Park, MD 20742, USA; Joint Institute for Food Safety and Applied Nutrition, University of Maryland, College Park, MD 20742, USA
| | - Abani K Pradhan
- Department of Nutrition and Food Science, University of Maryland, College Park, MD 20742, USA; Center for Food Safety and Security Systems, University of Maryland, College Park, MD 20742, USA.
| |
Collapse
|
5
|
Ejigu GF, Yi G, Kim JI, Jung J. ReGSP: a visualized application for homology-based gene searching and plotting using multiple reference sequences. PeerJ 2021; 9:e12707. [PMID: 35036172 PMCID: PMC8710255 DOI: 10.7717/peerj.12707] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2021] [Accepted: 12/07/2021] [Indexed: 12/17/2022] Open
Abstract
The massively parallel nature of next-generation sequencing technologies has contributed to the generation of massive sequence data in the last two decades. Deciphering the meaning of each generated sequence requires multiple analysis tools, at all stages of analysis, from the reads stage all the way up to the whole-genome level. Homology-based approaches based on related reference sequences are usually the preferred option for gene and transcript prediction in newly sequenced genomes, resulting in the popularity of a variety of BLAST and BLAST-based tools. For organelle genomes, a single-reference-based gene finding tool that uses grouping parameters for BLAST results has been implemented in the Genome Search Plotter (GSP). However, this tool does not accept multiple and user-customized reference sequences required for a broad homology search. Here, we present multiple Reference-based Gene Search and Plot (ReGSP), a simple and convenient web tool that accepts multiple reference sequences for homology-based gene search. The tool incorporates cPlot, a novel dot plot tool, for illustrating nucleotide sequence similarity between the query and the reference sequences. ReGSP has an easy-to-use web interface and is freely accessible at https://ds.mju.ac.kr/regsp.
Collapse
Affiliation(s)
- Girum Fitihamlak Ejigu
- Department of Information and Communication Engineering, Myongji University, Yongin-si, Gyeonggi-do, South Korea
| | - Gangman Yi
- Department of Multimedia Engineering, Dongguk University, Seoul, South Korea
| | - Jong Im Kim
- Department of Biology, Chungnam National University, Daejeon, South Korea
| | - Jaehee Jung
- Department of Information and Communication Engineering, Myongji University, Yongin-si, Gyeonggi-do, South Korea
| |
Collapse
|
6
|
Wang Q, Kille B, Liu TR, Elworth RAL, Treangen TJ. PlasmidHawk improves lab of origin prediction of engineered plasmids using sequence alignment. Nat Commun 2021; 12:1167. [PMID: 33637701 PMCID: PMC7910462 DOI: 10.1038/s41467-021-21180-w] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2020] [Accepted: 01/12/2021] [Indexed: 12/26/2022] Open
Abstract
With advances in synthetic biology and genome engineering comes a heightened awareness of potential misuse related to biosafety concerns. A recent study employed machine learning to identify the lab-of-origin of DNA sequences to help mitigate some of these concerns. Despite their promising results, this deep learning based approach had limited accuracy, was computationally expensive to train, and wasn't able to provide the precise features that were used in its predictions. To address these shortcomings, we developed PlasmidHawk for lab-of-origin prediction. Compared to a machine learning approach, PlasmidHawk has higher prediction accuracy; PlasmidHawk can successfully predict unknown sequences' depositing labs 76% of the time and 85% of the time the correct lab is in the top 10 candidates. In addition, PlasmidHawk can precisely single out the signature sub-sequences that are responsible for the lab-of-origin detection. In summary, PlasmidHawk represents an explainable and accurate tool for lab-of-origin prediction of synthetic plasmid sequences. PlasmidHawk is available at https://gitlab.com/treangenlab/plasmidhawk.git .
Collapse
Affiliation(s)
- Qi Wang
- Systems, Synthetic, and Physical Biology (SSPB) Graduate Program, Rice University, Houston, Texas, 77005, USA
| | - Bryce Kille
- Department of Computer Science, Rice University, Houston, Texas, 77005, United States
| | - Tian Rui Liu
- Department of Computer Science, Rice University, Houston, Texas, 77005, United States
| | - R A Leo Elworth
- Department of Computer Science, Rice University, Houston, Texas, 77005, United States
| | - Todd J Treangen
- Department of Computer Science, Rice University, Houston, Texas, 77005, United States.
| |
Collapse
|
7
|
Coding Exon-Structure Aware Realigner (CESAR): Utilizing Genome Alignments for Comparative Gene Annotation. Methods Mol Biol 2019; 1962:179-191. [PMID: 31020560 DOI: 10.1007/978-1-4939-9173-0_10] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]
Abstract
Alignment-based gene identification methods utilize sequence conservation between orthologous protein-coding genes to annotate genes in newly sequenced genomes. CESAR is an approach that makes use of existing genome alignments to transfer genes from one genome to other aligned genomes, and thus generates comparative gene annotations. To accurately detect conserved exons that exhibit an intact reading frame and consensus splice sites, CESAR produces a new alignment between orthologous exons, taking information about the exon's reading frame and splice site positions into account. Furthermore, CESAR is able to detect most evolutionary splice site shifts, which helps to annotate exon boundaries at high precision. Here, we describe how to apply CESAR to generate comparative gene annotations for one or many species, and discuss the strengths and limitations of this approach. CESAR is available at https://github.com/hillerlab/CESAR2.0 .
Collapse
|
8
|
Abstract
Abstract
Next Generation Sequencing (NGS) or deep sequencing technology enables parallel reading of multiple individual DNA fragments, thereby enabling the identification of millions of base pairs in several hours. Recent research has clearly shown that machine learning technologies can efficiently analyse large sets of genomic data and help to identify novel gene functions and regulation regions. A deep artificial neural network consists of a group of artificial neurons that mimic the properties of living neurons. These mathematical models, termed Artificial Neural Networks (ANN), can be used to solve artificial intelligence engineering problems in several different technological fields (e.g., biology, genomics, proteomics, and metabolomics). In practical terms, neural networks are non-linear statistical structures that are organized as modelling tools and are used to simulate complex genomic relationships between inputs and outputs. To date, Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNN) have been demonstrated to be the best tools for improving performance in problem solving tasks within the genomic field.
Collapse
|
9
|
Sharma V, Hiller M. Increased alignment sensitivity improves the usage of genome alignments for comparative gene annotation. Nucleic Acids Res 2017. [PMID: 28645144 PMCID: PMC5737078 DOI: 10.1093/nar/gkx554] [Citation(s) in RCA: 35] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023] Open
Abstract
Genome alignments provide a powerful basis to transfer gene annotations from a well-annotated reference genome to many other aligned genomes. The completeness of these annotations crucially depends on the sensitivity of the underlying genome alignment. Here, we investigated the impact of the genome alignment parameters and found that parameters with a higher sensitivity allow the detection of thousands of novel alignments between orthologous exons that have been missed before. In particular, comparisons between species separated by an evolutionary distance of >0.75 substitutions per neutral site, like human and other non-placental vertebrates, benefit from increased sensitivity. To systematically test if increased sensitivity improves comparative gene annotations, we built a multiple alignment of 144 vertebrate genomes and used this alignment to map human genes to the other 143 vertebrates with CESAR. We found that higher alignment sensitivity substantially improves the completeness of comparative gene annotations by adding on average 2382 and 7440 novel exons and 117 and 317 novel genes for mammalian and non-mammalian species, respectively. Our results suggest a more sensitive alignment strategy that should generally be used for genome alignments between distantly-related species. Our 144-vertebrate genome alignment and the comparative gene annotations (https://bds.mpi-cbg.de/hillerlab/144VertebrateAlignment_CESAR/) are a valuable resource for comparative genomics.
Collapse
Affiliation(s)
- Virag Sharma
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany.,Max Planck Institute for the Physics of Complex Systems, Dresden, Germany
| | - Michael Hiller
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany.,Max Planck Institute for the Physics of Complex Systems, Dresden, Germany
| |
Collapse
|
10
|
Making sense of genomes of parasitic worms: Tackling bioinformatic challenges. Biotechnol Adv 2016; 34:663-686. [DOI: 10.1016/j.biotechadv.2016.03.001] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2015] [Revised: 02/25/2016] [Accepted: 03/01/2016] [Indexed: 01/25/2023]
|
11
|
Sharma V, Elghafari A, Hiller M. Coding exon-structure aware realigner (CESAR) utilizes genome alignments for accurate comparative gene annotation. Nucleic Acids Res 2016; 44:e103. [PMID: 27016733 PMCID: PMC4914097 DOI: 10.1093/nar/gkw210] [Citation(s) in RCA: 34] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2016] [Revised: 03/04/2016] [Accepted: 03/18/2016] [Indexed: 12/03/2022] Open
Abstract
Identifying coding genes is an essential step in genome annotation. Here, we utilize existing whole genome alignments to detect conserved coding exons and then map gene annotations from one genome to many aligned genomes. We show that genome alignments contain thousands of spurious frameshifts and splice site mutations in exons that are truly conserved. To overcome these limitations, we have developed CESAR (Coding Exon-Structure Aware Realigner) that realigns coding exons, while considering reading frame and splice sites of each exon. CESAR effectively avoids spurious frameshifts in conserved genes and detects 91% of shifted splice sites. This results in the identification of thousands of additional conserved exons and 99% of the exons that lack inactivating mutations match real exons. Finally, to demonstrate the potential of using CESAR for comparative gene annotation, we applied it to 188 788 exons of 19 865 human genes to annotate human genes in 99 other vertebrates. These comparative gene annotations are available as a resource (http://bds.mpi-cbg.de/hillerlab/CESAR/). CESAR (https://github.com/hillerlab/CESAR/) can readily be applied to other alignments to accurately annotate coding genes in many other vertebrate and invertebrate genomes.
Collapse
Affiliation(s)
- Virag Sharma
- Max Planck Institute of Molecular Cell Biology and Genetics, Pfotenhauerstr. 108, 01307 Dresden, Germany Max Planck Institute for the Physics of Complex Systems, Nöthnitzer Str. 38, 01187 Dresden, Germany
| | - Anas Elghafari
- Max Planck Institute of Molecular Cell Biology and Genetics, Pfotenhauerstr. 108, 01307 Dresden, Germany Max Planck Institute for the Physics of Complex Systems, Nöthnitzer Str. 38, 01187 Dresden, Germany Technical University, 01069 Dresden, Germany
| | - Michael Hiller
- Max Planck Institute of Molecular Cell Biology and Genetics, Pfotenhauerstr. 108, 01307 Dresden, Germany Max Planck Institute for the Physics of Complex Systems, Nöthnitzer Str. 38, 01187 Dresden, Germany
| |
Collapse
|
12
|
Abstract
The field of machine learning, which aims to develop computer algorithms that improve with experience, holds promise to enable computers to assist humans in the analysis of large, complex data sets. Here, we provide an overview of machine learning applications for the analysis of genome sequencing data sets, including the annotation of sequence elements and epigenetic, proteomic or metabolomic data. We present considerations and recurrent challenges in the application of supervised, semi-supervised and unsupervised machine learning methods, as well as of generative and discriminative modelling approaches. We provide general guidelines to assist in the selection of these machine learning methods and their practical application for the analysis of genetic and genomic data sets.
Collapse
Affiliation(s)
- Maxwell W Libbrecht
- Department of Computer Science and Engineering, University of Washington, 185 Stevens Way, Seattle, Washington 98195-2350, USA
| | - William Stafford Noble
- 1] Department of Computer Science and Engineering, University of Washington, 185 Stevens Way, Seattle, Washington 98195-2350, USA. [2] Department of Genome Sciences, University of Washington, 3720 15th Ave NE Seattle, Washington 98195-5065, USA
| |
Collapse
|
13
|
Chen ZX, Sturgill D, Qu J, Jiang H, Park S, Boley N, Suzuki AM, Fletcher AR, Plachetzki DC, FitzGerald PC, Artieri CG, Atallah J, Barmina O, Brown JB, Blankenburg KP, Clough E, Dasgupta A, Gubbala S, Han Y, Jayaseelan JC, Kalra D, Kim YA, Kovar CL, Lee SL, Li M, Malley JD, Malone JH, Mathew T, Mattiuzzo NR, Munidasa M, Muzny DM, Ongeri F, Perales L, Przytycka TM, Pu LL, Robinson G, Thornton RL, Saada N, Scherer SE, Smith HE, Vinson C, Warner CB, Worley KC, Wu YQ, Zou X, Cherbas P, Kellis M, Eisen MB, Piano F, Kionte K, Fitch DH, Sternberg PW, Cutter AD, Duff MO, Hoskins RA, Graveley BR, Gibbs RA, Bickel PJ, Kopp A, Carninci P, Celniker SE, Oliver B, Richards S. Comparative validation of the D. melanogaster modENCODE transcriptome annotation. Genome Res 2015; 24:1209-23. [PMID: 24985915 PMCID: PMC4079975 DOI: 10.1101/gr.159384.113] [Citation(s) in RCA: 111] [Impact Index Per Article: 11.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Abstract
Accurate gene model annotation of reference genomes is critical for making them useful. The modENCODE project has improved the D. melanogaster genome annotation by using deep and diverse high-throughput data. Since transcriptional activity that has been evolutionarily conserved is likely to have an advantageous function, we have performed large-scale interspecific comparisons to increase confidence in predicted annotations. To support comparative genomics, we filled in divergence gaps in the Drosophila phylogeny by generating draft genomes for eight new species. For comparative transcriptome analysis, we generated mRNA expression profiles on 81 samples from multiple tissues and developmental stages of 15 Drosophila species, and we performed cap analysis of gene expression in D. melanogaster and D. pseudoobscura. We also describe conservation of four distinct core promoter structures composed of combinations of elements at three positions. Overall, each type of genomic feature shows a characteristic divergence rate relative to neutral models, highlighting the value of multispecies alignment in annotating a target genome that should prove useful in the annotation of other high priority genomes, especially human and other mammalian genomes that are rich in noncoding sequences. We report that the vast majority of elements in the annotation are evolutionarily conserved, indicating that the annotation will be an important springboard for functional genetic testing by the Drosophila community.
Collapse
Affiliation(s)
- Zhen-Xia Chen
- National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - David Sturgill
- National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Jiaxin Qu
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Huaiyang Jiang
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Soo Park
- Department of Genome Dynamics, Life Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, California 94720, USA
| | - Nathan Boley
- Department of Statistics, University of California, Berkeley, California 94720, USA
| | - Ana Maria Suzuki
- Technology Development Group, RIKEN Omics Science Center and RIKEN Center for Life Science Technologies, Division of Genomic Technologies, Yokohama City, Kanagawa, Japan 230-0045
| | - Anthony R Fletcher
- Division of Computational Bioscience, Center For Information Technology, National Institutes of Health, Bethesda, Maryland 20814, USA
| | - David C Plachetzki
- Department of Evolution and Ecology, University of California, Davis, California 95616, USA
| | - Peter C FitzGerald
- National Cancer Institute, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Carlo G Artieri
- National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Joel Atallah
- Department of Evolution and Ecology, University of California, Davis, California 95616, USA
| | - Olga Barmina
- Department of Evolution and Ecology, University of California, Davis, California 95616, USA
| | - James B Brown
- Department of Statistics, University of California, Berkeley, California 94720, USA
| | - Kerstin P Blankenburg
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Emily Clough
- National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Abhijit Dasgupta
- Clinical Trials and Outcomes Branch, National Institute of Arthritis and Musculoskeletal and Skin Diseases, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Sai Gubbala
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Yi Han
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Joy C Jayaseelan
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Divya Kalra
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Yoo-Ah Kim
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Christie L Kovar
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Sandra L Lee
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Mingmei Li
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - James D Malley
- Division of Computational Bioscience, Center For Information Technology, National Institutes of Health, Bethesda, Maryland 20814, USA
| | - John H Malone
- National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Tittu Mathew
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Nicolas R Mattiuzzo
- National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Mala Munidasa
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Donna M Muzny
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Fiona Ongeri
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Lora Perales
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Teresa M Przytycka
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Ling-Ling Pu
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Garrett Robinson
- Department of Statistics, University of California, Berkeley, California 94720, USA
| | - Rebecca L Thornton
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Nehad Saada
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Steven E Scherer
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Harold E Smith
- National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Charles Vinson
- National Cancer Institute, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Crystal B Warner
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Kim C Worley
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Yuan-Qing Wu
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Xiaoyan Zou
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Peter Cherbas
- Department of Biology, Indiana University, Bloomington, Indiana 47405, USA
| | - Manolis Kellis
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 20139, USA
| | - Michael B Eisen
- Molecular and Cell Biology, University of California, Berkeley, California 94720, USA
| | - Fabio Piano
- Department of Biology, New York University, New York, New York 10003, USA
| | - Karin Kionte
- Department of Biology, New York University, New York, New York 10003, USA
| | - David H Fitch
- Department of Biology, New York University, New York, New York 10003, USA
| | - Paul W Sternberg
- HHMI and Division of Biology, California Institute of Technology, Pasadena, California 91125, USA
| | - Asher D Cutter
- Department of Ecology and Evolutionary Biology, University of Toronto, Toronto, M5S 3B2, Canada
| | - Michael O Duff
- Department of Genetics and Developmental Biology, Institute for Systems Genomics, University of Connecticut Health Center, Farmington, Connecticut 06030-6403, USA
| | - Roger A Hoskins
- Department of Genome Dynamics, Life Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, California 94720, USA
| | - Brenton R Graveley
- Department of Genetics and Developmental Biology, Institute for Systems Genomics, University of Connecticut Health Center, Farmington, Connecticut 06030-6403, USA
| | - Richard A Gibbs
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Peter J Bickel
- Department of Statistics, University of California, Berkeley, California 94720, USA
| | - Artyom Kopp
- Department of Evolution and Ecology, University of California, Davis, California 95616, USA
| | - Piero Carninci
- Technology Development Group, RIKEN Omics Science Center and RIKEN Center for Life Science Technologies, Division of Genomic Technologies, Yokohama City, Kanagawa, Japan 230-0045
| | - Susan E Celniker
- Department of Genome Dynamics, Life Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, California 94720, USA
| | - Brian Oliver
- National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Stephen Richards
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| |
Collapse
|
14
|
Curran DM, Gilleard JS, Wasmuth JD. Figmop: a profile HMM to identify genes and bypass troublesome gene models in draft genomes. ACTA ACUST UNITED AC 2014; 30:3266-7. [PMID: 25115706 DOI: 10.1093/bioinformatics/btu544] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
MOTIVATION Gene models from draft genome assemblies of metazoan species are often incorrect, missing exons or entire genes, particularly for large gene families. Consequently, labour-intensive manual curation is often necessary. We present Figmop (Finding Genes using Motif Patterns) to help with the manual curation of gene families in draft genome assemblies. The program uses a pattern of short sequence motifs to identify putative genes directly from the genome sequence. Using a large gene family as a test case, Figmop was found to be more sensitive and specific than a BLAST-based approach. The visualization used allows the validation of potential genes to be carried out quickly and easily, saving hours if not days from an analysis. AVAILABILITY AND IMPLEMENTATION Source code of Figmop is freely available for download at https://github.com/dave-the-scientist, implemented in C and Python and is supported on Linux, Unix and MacOSX. CONTACT curran.dave.m@gmail.com SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- David M Curran
- Department of Ecosystem and Public Health and Department of Comparative Biology and Experimental Medicine, Faculty of Veterinary Medicine, University of Calgary, Calgary, Alberta, T2N 4Z6, Canada Department of Ecosystem and Public Health and Department of Comparative Biology and Experimental Medicine, Faculty of Veterinary Medicine, University of Calgary, Calgary, Alberta, T2N 4Z6, Canada
| | - John S Gilleard
- Department of Ecosystem and Public Health and Department of Comparative Biology and Experimental Medicine, Faculty of Veterinary Medicine, University of Calgary, Calgary, Alberta, T2N 4Z6, Canada
| | - James D Wasmuth
- Department of Ecosystem and Public Health and Department of Comparative Biology and Experimental Medicine, Faculty of Veterinary Medicine, University of Calgary, Calgary, Alberta, T2N 4Z6, Canada
| |
Collapse
|
15
|
Robert C, Fuentes-Utrilla P, Troup K, Loecherbach J, Turner F, Talbot R, Archibald AL, Mileham A, Deeb N, Hume DA, Watson M. Design and development of exome capture sequencing for the domestic pig (Sus scrofa). BMC Genomics 2014; 15:550. [PMID: 24988888 PMCID: PMC4099480 DOI: 10.1186/1471-2164-15-550] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2014] [Accepted: 06/19/2014] [Indexed: 12/30/2022] Open
Abstract
Background The domestic pig (Sus scrofa) is both an important livestock species and a model for biomedical research. Exome sequencing has accelerated identification of protein-coding variants underlying phenotypic traits in human and mouse. We aimed to develop and validate a similar resource for the pig. Results We developed probe sets to capture pig exonic sequences based upon the current Ensembl pig gene annotation supplemented with mapped expressed sequence tags (ESTs) and demonstrated proof-of-principle capture and sequencing of the pig exome in 96 pigs, encompassing 24 capture experiments. For most of the samples at least 10x sequence coverage was achieved for more than 90% of the target bases. Bioinformatic analysis of the data revealed over 236,000 high confidence predicted SNPs and over 28,000 predicted indels. Conclusions We have achieved coverage statistics similar to those seen with commercially available human and mouse exome kits. Exome capture in pigs provides a tool to identify coding region variation associated with production traits, including loss of function mutations which may explain embryonic and neonatal losses, and to improve genomic assemblies in the vicinity of protein coding genes in the pig. Electronic supplementary material The online version of this article (doi:10.1186/1471-2164-15-550) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | | | | | - Mick Watson
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Easter Bush, Edinburgh EH25 9RG, UK.
| |
Collapse
|
16
|
van der Burgt A, Severing E, Collemare J, de Wit PJGM. Automated alignment-based curation of gene models in filamentous fungi. BMC Bioinformatics 2014; 15:19. [PMID: 24433567 PMCID: PMC3898260 DOI: 10.1186/1471-2105-15-19] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2013] [Accepted: 01/11/2014] [Indexed: 11/16/2022] Open
Abstract
Background Automated gene-calling is still an error-prone process, particularly for the highly plastic genomes of fungal species. Improvement through quality control and manual curation of gene models is a time-consuming process that requires skilled biologists and is only marginally performed. The wealth of available fungal genomes has not yet been exploited by an automated method that applies quality control of gene models in order to obtain more accurate genome annotations. Results We provide a novel method named alignment-based fungal gene prediction (ABFGP) that is particularly suitable for plastic genomes like those of fungi. It can assess gene models on a gene-by-gene basis making use of informant gene loci. Its performance was benchmarked on 6,965 gene models confirmed by full-length unigenes from ten different fungi. 79.4% of all gene models were correctly predicted by ABFGP. It improves the output of ab initio gene prediction software due to a higher sensitivity and precision for all gene model components. Applicability of the method was shown by revisiting the annotations of six different fungi, using gene loci from up to 29 fungal genomes as informants. Between 7,231 and 8,337 genes were assessed by ABFGP and for each genome between 1,724 and 3,505 gene model revisions were proposed. The reliability of the proposed gene models is assessed by an a posteriori introspection procedure of each intron and exon in the multiple gene model alignment. The total number and type of proposed gene model revisions in the six fungal genomes is correlated to the quality of the genome assembly, and to sequencing strategies used in the sequencing centre, highlighting different types of errors in different annotation pipelines. The ABFGP method is particularly successful in discovering sequence errors and/or disruptive mutations causing truncated and erroneous gene models. Conclusions The ABFGP method is an accurate and fully automated quality control method for fungal gene catalogues that can be easily implemented into existing annotation pipelines. With the exponential release of new genomes, the ABFGP method will help decreasing the number of gene models that require additional manual curation.
Collapse
Affiliation(s)
| | | | | | - Pierre J G M de Wit
- Laboratory of Phytopathology, Wageningen University & Research Centre, P,O, Box 16, 6700 AA Wageningen, The Netherlands.
| |
Collapse
|
17
|
ASPic-GeneID: a lightweight pipeline for gene prediction and alternative isoforms detection. BIOMED RESEARCH INTERNATIONAL 2013; 2013:502827. [PMID: 24308000 PMCID: PMC3838850 DOI: 10.1155/2013/502827] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/16/2013] [Revised: 08/01/2013] [Accepted: 08/04/2013] [Indexed: 12/31/2022]
Abstract
New genomes are being sequenced at an increasingly rapid rate, far outpacing the rate at which manual gene annotation can be performed. Automated genome annotation is thus necessitated by this growth in genome projects; however, full-fledged annotation systems are usually home-grown and customized to a particular genome. There is thus a renewed need for accurate ab initio gene prediction methods. However, it is apparent that fully ab initio methods fall short of the required level of sensitivity and specificity for a quality annotation. Evidence in the form of expressed sequences gives the single biggest improvement in accuracy when used to inform gene predictions. Here, we present a lightweight pipeline for first-pass gene prediction on newly sequenced genomes. The two main components are ASPic, a program that derives highly accurate, albeit not necessarily complete, EST-based transcript annotations from EST alignments, and GeneID, a standard gene prediction program, which we have modified to take as evidence intron annotations. The introns output by ASPic CDS predictions is given to GeneID to constrain the exon-chaining process and produce predictions consistent with the underlying EST alignments. The pipeline was successfully tested on the entire C. elegans genome and the 44 ENCODE human pilot regions.
Collapse
|
18
|
Abstract
The sequencing of the complete genome of the nematode Caenorhabditis elegans was a landmark achievement and ushered in a new era of whole-organism, systems analyses of the biology of this powerful model organism. The success of the C. elegans genome sequencing project also inspired communities working on other organisms to approach genome sequencing of their species. The phylum Nematoda is rich and diverse and of interest to a wide range of research fields from basic biology through ecology and parasitic disease. For all these communities, it is now clear that access to genome scale data will be key to advancing understanding, and in the case of parasites, developing new ways to control or cure diseases. The advent of second-generation sequencing technologies, improvements in computing algorithms and infrastructure and growth in bioinformatics and genomics literacy is making the addition of genome sequencing to the research goals of any nematode research program a less daunting prospect. To inspire, promote and coordinate genomic sequencing across the diversity of the phylum, we have launched a community wiki and the 959 Nematode Genomes initiative (www.nematodegenomes.org/). Just as the deciphering of the developmental lineage of the 959 cells of the adult hermaphrodite C. elegans was the gateway to broad advances in biomedical science, we hope that a nematode phylogeny with (at least) 959 sequenced species will underpin further advances in understanding the origins of parasitism, the dynamics of genomic change and the adaptations that have made Nematoda one of the most successful animal phyla.
Collapse
Affiliation(s)
- Sujai Kumar
- Institute of Evolutionary Biology; University of Edinburgh; Edinburgh, UK
| | | | | | | |
Collapse
|
19
|
Searls DB. A primer in macromolecular linguistics. Biopolymers 2012; 99:203-17. [PMID: 23034580 DOI: 10.1002/bip.22101] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2012] [Accepted: 05/25/2012] [Indexed: 01/01/2023]
Abstract
Polymeric macromolecules, when viewed abstractly as strings of symbols, can be treated in terms of formal language theory, providing a mathematical foundation for characterizing such strings both as collections and in terms of their individual structures. In addition this approach offers a framework for analysis of macromolecules by tools and conventions widely used in computational linguistics. This article introduces the ways that linguistics can be and has been applied to molecular biology, covering the relevant formal language theory at a relatively nontechnical level. Analogies between macromolecules and human natural language are used to provide intuitive insights into the relevance of grammars, parsing, and analysis of language complexity to biology.
Collapse
|
20
|
Pohl M, Theissen G, Schuster S. GC content dependency of open reading frame prediction via stop codon frequencies. Gene 2012; 511:441-6. [PMID: 23000023 DOI: 10.1016/j.gene.2012.09.031] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2011] [Revised: 04/27/2012] [Accepted: 09/05/2012] [Indexed: 11/18/2022]
Abstract
A frequently used approach for detecting potential coding regions is to search for stop codons. In the standard genetic code 3 out of 64 trinucleotides are stop codons. Hence, in random or non-coding DNA one can expect every 21st trinucleotide to have the same sequence as a stop codon. In contrast, the open reading frames (ORFs) of most protein-coding genes are considerably longer. Thus, the stop codon frequency in coding sequences deviates from the background frequency of the corresponding trinucleotides. This has been utilized for gene prediction, in particular, in detecting protein-coding ORFs. Traditional methods based on stop codon frequency are based on the assumption that the GC content is about 50%. However, many genomes show significant deviations from that value. With the presented method we can describe the effects of GC content on the selection of appropriate length thresholds of potentially coding ORFs. Conversely, for a given length threshold, we can calculate the probability of observing it in a random sequence. Thus, we can derive the maximum GC content for which ORF length is practicable as a feature for gene prediction methods and the resulting false positive rates. A rough estimate for an upper limit is a GC content of 80%. This estimate can be made more precise by including further parameters and by taking into account start codons as well. We demonstrate the feasibility of this method by applying it to the genomes of the bacteria Rickettsia prowazekii, Escherichia coli and Caulobacter crescentus, exemplifying the effect of GC content variations according to our predictions. We have adapted the method for predicting coding ORFs by stop codon frequency to the case of GC contents different from 50%. Usually, several methods for gene finding need to be combined. Thus, our results concern a specific part within a package of methods. Interestingly, for genomes with low GC content such as that of R. prowazekii, the presented method provides remarkably good results even when applied alone.
Collapse
Affiliation(s)
- Martin Pohl
- Department of Bioinformatics, Friedrich Schiller University Jena, Ernst-Abbe-Platz 2, 07745 Jena, Germany.
| | | | | |
Collapse
|
21
|
Boerjan B, Cardoen D, Verdonck R, Caers J, Schoofs L. Insect omics research coming of age1This review is part of a virtual symposium on recent advances in understanding a variety of complex regulatory processes in insect physiology and endocrinology, including development, metabolism, cold hardiness, food intake and digestion, and diuresis, through the use of omics technologies in the postgenomic era. CAN J ZOOL 2012. [DOI: 10.1139/z2012-010] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
As more and more insect genomes are fully sequenced and annotated, omics technologies, including transcriptomic, proteomic, peptidomics, and metobolomic profiling, as well as bioinformatics, can be used to exploit this huge amount of sequence information for the study of different biological aspects of insect model organisms. Omics experiments are an elegant way to deliver candidate genes, the function of which can be further explored by genetic tools for functional inactivation or overexpression of the genes of interest. Such tools include mainly RNA interference and are currently being developed in diverse insect species. In this manuscript, we have reviewed how omics technologies were integrated and applied in insect biology.
Collapse
Affiliation(s)
- Bart Boerjan
- Research Group of Functional Genomics and Proteomics, KU Leuven, Naamsestraat 59, B-3000 Leuven, Belgium
| | - Dries Cardoen
- Research Group of Functional Genomics and Proteomics, KU Leuven, Naamsestraat 59, B-3000 Leuven, Belgium
- Laboratory of Entomology, KU Leuven, Naamsestraat 59, B-3000 Leuven, Belgium
| | - Rik Verdonck
- Research Group of Molecular Developmental Physiology and Signal Transduction, KU Leuven, Naamsestraat 59, B-3000 Leuven, Belgium
| | - Jelle Caers
- Research Group of Functional Genomics and Proteomics, KU Leuven, Naamsestraat 59, B-3000 Leuven, Belgium
| | - Liliane Schoofs
- Research Group of Functional Genomics and Proteomics, KU Leuven, Naamsestraat 59, B-3000 Leuven, Belgium
| |
Collapse
|
22
|
Gilchrist MJ. From expression cloning to gene modeling: the development of Xenopus gene sequence resources. Genesis 2012; 50:143-54. [PMID: 22344767 PMCID: PMC3488295 DOI: 10.1002/dvg.22008] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2011] [Revised: 12/09/2011] [Accepted: 12/21/2011] [Indexed: 11/08/2022]
Abstract
The Xenopus community has made concerted efforts over the last 10–12 years systematically to improve the available sequence information for this amphibian model organism ideally suited to the study of early development in vertebrates. Here I review progress in the collection of both sequence data and physical clone reagents for protein coding genes. I conclude that we have cDNA sequences for around 50% and full-length clones for about 35% of the genes in Xenopus tropicalis, and similar numbers but a smaller proportion for Xenopus laevis. In addition, I demonstrate that the gaps in the current genome assembly create problems for the computational elucidation of gene sequences, and suggest some ways to ameliorate the effects of this. genesis 50:143–154, 2012. © 2012 Wiley Periodicals, Inc.
Collapse
Affiliation(s)
- Michael J Gilchrist
- Division of Systems Biology, MRC National Institute for Medical Research, The Ridgeway, Mill Hill, London, United Kingdom.
| |
Collapse
|
23
|
Shepard SS, McSweeny A, Serpen G, Fedorov A. Exploiting mid-range DNA patterns for sequence classification: binary abstraction Markov models. Nucleic Acids Res 2012; 40:4765-73. [PMID: 22344692 PMCID: PMC3367190 DOI: 10.1093/nar/gks154] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023] Open
Abstract
Messenger RNA sequences possess specific nucleotide patterns distinguishing them from non-coding genomic sequences. In this study, we explore the utilization of modified Markov models to analyze sequences up to 44 bp, far beyond the 8-bp limit of conventional Markov models, for exon/intron discrimination. In order to analyze nucleotide sequences of this length, their information content is first reduced by conversion into shorter binary patterns via the application of numerous abstraction schemes. After the conversion of genomic sequences to binary strings, homogenous Markov models trained on the binary sequences are used to discriminate between exons and introns. We term this approach the Binary Abstraction Markov Model (BAMM). High-quality abstraction schemes for exon/intron discrimination are selected using optimization algorithms on supercomputers. The best MM classifiers are then combined using support vector machines into a single classifier. With this approach, over 95% classification accuracy is achieved without taking reading frame into account. With further development, the BAMM approach can be applied to sequences lacking the genetic code such as ncRNAs and 5′-untranslated regions.
Collapse
Affiliation(s)
- Samuel S Shepard
- Department of Medicine, University of Toledo, Health Science Campus, Toledo, OH 43614, USA
| | | | | | | |
Collapse
|
24
|
Hamada M, Asai K. A classification of bioinformatics algorithms from the viewpoint of maximizing expected accuracy (MEA). J Comput Biol 2012; 19:532-49. [PMID: 22313125 DOI: 10.1089/cmb.2011.0197] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Many estimation problems in bioinformatics are formulated as point estimation problems in a high-dimensional discrete space. In general, it is difficult to design reliable estimators for this type of problem, because the number of possible solutions is immense, which leads to an extremely low probability for every solution-even for the one with the highest probability. Therefore, maximum score and maximum likelihood estimators do not work well in this situation although they are widely employed in a number of applications. Maximizing expected accuracy (MEA) estimation, in which accuracy measures of the target problem and the entire distribution of solutions are considered, is a more successful approach. In this review, we provide an extensive discussion of algorithms and software based on MEA. We describe how a number of algorithms used in previous studies can be classified from the viewpoint of MEA. We believe that this review will be useful not only for users wishing to utilize software to solve the estimation problems appearing in this article, but also for developers wishing to design algorithms on the basis of MEA.
Collapse
Affiliation(s)
- Michiaki Hamada
- Graduate School of Frontier Sciences, University of Tokyo, Kashiwa, Japan.
| | | |
Collapse
|
25
|
Kuraku S, Meyer A. Detection and phylogenetic assessment of conserved synteny derived from whole genome duplications. Methods Mol Biol 2012; 855:385-95. [PMID: 22407717 DOI: 10.1007/978-1-61779-582-4_14] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Identification of intragenomic conservation of gene compositions in multiple chromosomal segments led to evidence of whole genome (WGDs) duplications. The process by which WGDs have been maintained and decayed provides us with clues for understanding how the genome evolves. In this chapter, we summarize current understanding of phylogenetic distribution and evolutionary impact of WGDs, introduce basic procedures to detect conserved synteny, and discuss typical pitfalls, as well as biological insights.
Collapse
Affiliation(s)
- Shigehiro Kuraku
- Genome Resource and Analysis Unit, RIKEN Center for Developmental Biology, Chuo-ku, Kobe, Japan.
| | | |
Collapse
|
26
|
Hatje K, Keller O, Hammesfahr B, Pillmann H, Waack S, Kollmar M. Cross-species protein sequence and gene structure prediction with fine-tuned Webscipio 2.0 and Scipio. BMC Res Notes 2011; 4:265. [PMID: 21798037 PMCID: PMC3162530 DOI: 10.1186/1756-0500-4-265] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2011] [Accepted: 07/28/2011] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Obtaining transcripts of homologs of closely related organisms and retrieving the reconstructed exon-intron patterns of the genes is a very important process during the analysis of the evolution of a protein family and the comparative analysis of the exon-intron structure of a certain gene from different species. Due to the ever-increasing speed of genome sequencing, the gap to genome annotation is growing. Thus, tools for the correct prediction and reconstruction of genes in related organisms become more and more important. The tool Scipio, which can also be used via the graphical interface WebScipio, performs significant hit processing of the output of the Blat program to account for sequencing errors, missing sequence, and fragmented genome assemblies. However, Scipio has so far been limited to high sequence similarity and unable to reconstruct short exons. RESULTS Scipio and WebScipio have fundamentally been extended to better reconstruct very short exons and intron splice sites and to be better suited for cross-species gene structure predictions. The Needleman-Wunsch algorithm has been implemented for the search for short parts of the query sequence that were not recognized by Blat. Those regions might either be short exons, divergent sequence at intron splice sites, or very divergent exons. We have shown the benefit and use of new parameters with several protein examples from completely different protein families in searches against species from several kingdoms of the eukaryotes. The performance of the new Scipio version has been tested in comparison with several similar tools. CONCLUSIONS With the new version of Scipio very short exons, terminal and internal, of even just one amino acid can correctly be reconstructed. Scipio is also able to correctly predict almost all genes in cross-species searches even if the ancestors of the species separated more than 100 Myr ago and if the protein sequence identity is below 80%. For our test cases Scipio outperforms all other software tested. WebScipio has been restructured and provides easy access to the genome assemblies of about 640 eukaryotic species. Scipio and WebScipio are freely accessible at http://www.webscipio.org.
Collapse
Affiliation(s)
- Klas Hatje
- Abteilung NMR basierte Strukturbiologie, Max-Planck-Institut für Biophysikalische Chemie, Am Fassberg 11, D-37077 Göttingen, Germany.
| | | | | | | | | | | |
Collapse
|
27
|
Yuryev A. Integrating fragmented software applications into holistic solutions: focus on drug discovery. Expert Opin Drug Discov 2011; 6:383-92. [PMID: 22646016 DOI: 10.1517/17460441.2011.557659] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
INTRODUCTION Current advances in software development and global molecular profiling technologies allow the development of holistic software solutions for drug discovery. Such solutions must streamline in silico drug and therapy development by integrating all types of data into one knowledge base and also by enabling continuous analysis workflows uninterrupted by manual restructuring of inputs and outputs from workflow components. They must provide a collaborative environment for data sharing between multiple users and allow importing of all types of experimental data for subsequent analysis. AREAS COVERED The reader is provided with a review of disparate software applications currently used in drug development and a discussion of existing organizational challenges for development of holistic software solutions. The reader is also provided with a proposed conceptual framework for integration of software components and some details for its implementation are suggested. EXPERT OPINION Holistic solutions can undoubtedly affect the speed, quality and cost of drug development and personalized therapy. However, it must be constantly evolved to rapidly adopt new experimental and statistical methods, incorporate advances in software technologies and allow perpetual optimization of its components. Perpetual improvements in data structure, data quality, statistical algorithms and other mathematical approaches for computer modeling can gradually shift financial and cultural emphasis in the pharmaceutical industry away from traditional experimental approaches and towards computational approaches.
Collapse
Affiliation(s)
- Anton Yuryev
- Ariadne Genomics, Inc., 9430 Key West Avenue, Suite 113, Rockville, MD 20850, USA +1 240 453 6296 ext 116 ; +1 270 912 6658 ;
| |
Collapse
|
28
|
Abstract
Genome sequences are quickly being generated from a variety of organisms and provide researchers with an abundance of previously inaccessible information and an important source of insight into immune mechanisms. There are a variety of methods to accurately characterize genes from new genome sequences, but immune receptors pose special challenges for these techniques. Immune receptors, particularly those that directly recognize pathogens, often diverge rapidly among species and are commonly found in large, complex multigene families. Because of these characteristics, immune receptors tend to be overlooked or misannotated in large-scale genomic surveys. We describe here a computational strategy to characterize homologs of immune receptors and also to identify putative novel receptors from newly assembled genome sequences. The description of these protocols is aimed at a typical immunologist, and a substantial knowledge of bioinformatics is not expected. The approach is based on using low-stringency sequence searches to identify divergent homologs. For receptors with multiple domains, the intersection of low-stringency searches can be used to identify divergent receptor sequences with high confidence. For multigene families, these predictions can be refined using sequence conservation among gene family paralogs. This strategy has recently been useful in identifying novel expansions in immune receptors in a number of animal genomes and will likely continue to revolutionize our view of animal immunity as new genomes emerge.
Collapse
Affiliation(s)
- Katherine M Buckley
- Department of Immunology and Department of MedicalBiophysics, University of Toronto and Sunnybrook Research Institute, Toronto, Ontario, Canada
| | | |
Collapse
|
29
|
Hendrickson RC, Wang C, Hatcher EL, Lefkowitz EJ. Orthopoxvirus genome evolution: the role of gene loss. Viruses 2010; 2:1933-1967. [PMID: 21994715 PMCID: PMC3185746 DOI: 10.3390/v2091933] [Citation(s) in RCA: 153] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2010] [Revised: 08/25/2010] [Accepted: 09/01/2010] [Indexed: 12/26/2022] Open
Abstract
Poxviruses are highly successful pathogens, known to infect a variety of hosts. The family Poxviridae includes Variola virus, the causative agent of smallpox, which has been eradicated as a public health threat but could potentially reemerge as a bioterrorist threat. The risk scenario includes other animal poxviruses and genetically engineered manipulations of poxviruses. Studies of orthologous gene sets have established the evolutionary relationships of members within the Poxviridae family. It is not clear, however, how variations between family members arose in the past, an important issue in understanding how these viruses may vary and possibly produce future threats. Using a newly developed poxvirus-specific tool, we predicted accurate gene sets for viruses with completely sequenced genomes in the genus Orthopoxvirus. Employing sensitive sequence comparison techniques together with comparison of syntenic gene maps, we established the relationships between all viral gene sets. These techniques allowed us to unambiguously identify the gene loss/gain events that have occurred over the course of orthopoxvirus evolution. It is clear that for all existing Orthopoxvirus species, no individual species has acquired protein-coding genes unique to that species. All existing species contain genes that are all present in members of the species Cowpox virus and that cowpox virus strains contain every gene present in any other orthopoxvirus strain. These results support a theory of reductive evolution in which the reduction in size of the core gene set of a putative ancestral virus played a critical role in speciation and confining any newly emerging virus species to a particular environmental (host or tissue) niche.
Collapse
Affiliation(s)
- Robert Curtis Hendrickson
- Department of Microbiology, University of Alabama at Birmingham, BBRB 276/11, 845 19th St S, Birmingham, AL 35222, USA; E-Mails: (R.C.H.); (E.L.H.)
| | - Chunlin Wang
- Stanford Genome Technology Center, Stanford University, 855 California Ave, Palo Alto, CA 94304, USA; E-Mail:
| | - Eneida L. Hatcher
- Department of Microbiology, University of Alabama at Birmingham, BBRB 276/11, 845 19th St S, Birmingham, AL 35222, USA; E-Mails: (R.C.H.); (E.L.H.)
| | - Elliot J. Lefkowitz
- Department of Microbiology, University of Alabama at Birmingham, BBRB 276/11, 845 19th St S, Birmingham, AL 35222, USA; E-Mails: (R.C.H.); (E.L.H.)
| |
Collapse
|