1
|
Chen L, Ma J, Xiang S, Jiang L, Wang Y, Li Z, Liu X, Duan S, Luo Y, Xiao Y. Promotion of rice seedlings growth and enhancement of cadmium immobilization under cadmium stress with two types of organic fertilizer. Environ Pollut 2024; 346:123619. [PMID: 38401632 DOI: 10.1016/j.envpol.2024.123619] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/07/2023] [Revised: 02/03/2024] [Accepted: 02/19/2024] [Indexed: 02/26/2024]
Abstract
Cadmium (Cd)-contaminated soil poses a severe threat to crop production and human health, while also resulting in a waste of land resources. In this study, two types of organic fertilizer (ZCK: Low-content available iron; Z2: High-content available iron) were applied to Cd-contaminated soil for rice cultivation, and the effects of the fertilizer on rice growth and Cd passivation were investigated in conjunction with soil microbial analysis. Results showed that Z2 could alter the composition, structure, and diversity of microbial communities, as well as enhance the complexity and stability of the microbial network. Both 2% and 5% Z2 significantly increased the fresh weight and dry weight of rice plants while suppressing Cd absorption. The 2% Z2 exhibited the best Cd passivation effect. Gene predictions suggested that Z2 may promote plant growth by regulating microbial production of organic acids that dissolve phosphorus and potassium. Furthermore, it is suggested that Z2 may facilitate the absorption and immobilization of soil cadmium through the regulation of microbial cadmium efflux and uptake systems, as well as via the secretion of extracellular polysaccharides. In summary, Z2 can promote rice growth, suppress Cd absorption by rice, and passivate soil Cd by regulating soil microbial communities.
Collapse
Affiliation(s)
- Liang Chen
- College of Bioscience and Biotechnology, Hunan Agricultural University, China
| | - Jingjing Ma
- College of Bioscience and Biotechnology, Hunan Agricultural University, China
| | - Sha Xiang
- College of Bioscience and Biotechnology, Hunan Agricultural University, China
| | - Lihong Jiang
- College of Resources, Hunan Agricultural University, China
| | - Ying Wang
- College of Bioscience and Biotechnology, Hunan Agricultural University, China
| | - Zhihuan Li
- College of Bioscience and Biotechnology, Hunan Agricultural University, China
| | - Xianjing Liu
- College of Bioscience and Biotechnology, Hunan Agricultural University, China
| | - Shuyang Duan
- College of Bioscience and Biotechnology, Hunan Agricultural University, China
| | - Yuan Luo
- College of Bioscience and Biotechnology, Hunan Agricultural University, China
| | - Yunhua Xiao
- College of Bioscience and Biotechnology, Hunan Agricultural University, China.
| |
Collapse
|
2
|
Jiang L, Dai J, Wang L, Chen L, Zeng G, Liu E, Zhou X, Yao H, Xiao Y, Fang J. Effect of nitrogen retention composite additives Ca(H 2PO 4) 2 and MgSO 4 on the degradation of lignocellulose, compost maturation, and fungal communities in compost. Environ Sci Pollut Res Int 2024:10.1007/s11356-024-32992-w. [PMID: 38558335 DOI: 10.1007/s11356-024-32992-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/12/2023] [Accepted: 03/15/2024] [Indexed: 04/04/2024]
Abstract
This study investigated the effects of the nitrogen retention composite additives Ca(H2PO4)2 and MgSO4 on lignocellulose degradation, maturation, and fungal communities in composts. The study included control (C, without Ca(H2PO4)2 and MgSO4), 1% Ca(H2PO4)2 + 2% MgSO4 (CaPM1), 1.5% Ca(H2PO4)2 + 3% MgSO4 (CaPM2). The results showed that Ca(H2PO4)2 and MgSO4 enhanced the degradation of total organic carbon (TOC) and promoted the degradation of lignocellulose in compost, with CaPM2 showing the highest TOC and lignocellulose degradation. Changes in the three-dimensional excitation-emission matrix fluorescence spectroscopy (3D-EEM) of dissolved organic matter (DOM) components in compost indicated that the treatment group with the addition of Ca(H2PO4)2 and MgSO4 promoted the production of humic acids (HAs) and increased the degree of compost decomposition, with CaPM2 demonstrating the highest degree of decomposition. The addition of Ca(H2PO4)2 and MgSO4 modified the composition of the fungal community. Ca(H2PO4)2 and MgSO4 increased the relative abundance of Ascomycota, decreased unclassified_Fungi, and Glomeromycota, and activated the fungal genera Thermomyces and Aspergillus, which can degrade lignin and cellulose during the thermophilic stage of composting. Ca(H2PO4)2 and MgSO4 also increased the abundance of Saprotroph, particularly undefined Saprotroph. In conclusion, the addition of Ca(H2PO4)2 and MgSO4 in composting activated fungal communities involved in lignocellulose degradation, promoted the degradation of lignocellulose, and enhanced the maturation degree of compost.
Collapse
Affiliation(s)
- Lihong Jiang
- College of Bioscience and Biotechnology, Hunan Agricultural University, Changsha, 410128, China
- Hunan Engineering Laboratory for Pollution Control and Waste, Utilization in Swine Production, Changsha, 410128, China
| | - Jiapeng Dai
- College of Bioscience and Biotechnology, Hunan Agricultural University, Changsha, 410128, China
| | - Lutong Wang
- College of Bioscience and Biotechnology, Hunan Agricultural University, Changsha, 410128, China
| | - Liang Chen
- College of Bioscience and Biotechnology, Hunan Agricultural University, Changsha, 410128, China
| | - Guangxi Zeng
- College of Bioscience and Biotechnology, Hunan Agricultural University, Changsha, 410128, China
| | - Erlun Liu
- College of Bioscience and Biotechnology, Hunan Agricultural University, Changsha, 410128, China
| | - Xiangdan Zhou
- College of Bioscience and Biotechnology, Hunan Agricultural University, Changsha, 410128, China
| | - Hao Yao
- Board of Directors Department, Changsha IMADEK Intelligent Technology Company Limited, Changsha, 410137, China
| | - Yunhua Xiao
- College of Bioscience and Biotechnology, Hunan Agricultural University, Changsha, 410128, China
- Hunan Engineering Laboratory for Pollution Control and Waste, Utilization in Swine Production, Changsha, 410128, China
| | - Jun Fang
- College of Bioscience and Biotechnology, Hunan Agricultural University, Changsha, 410128, China.
- Hunan Engineering Laboratory for Pollution Control and Waste, Utilization in Swine Production, Changsha, 410128, China.
| |
Collapse
|
3
|
Wu LF, Zhu WG, Yu EP, Cao HL, Wang ZF. Draft genome of Brasenia schreberi, a worldwide distributed and endangered aquatic plant. BMC Genom Data 2024; 25:24. [PMID: 38438998 PMCID: PMC10913576 DOI: 10.1186/s12863-024-01212-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2024] [Accepted: 02/21/2024] [Indexed: 03/06/2024] Open
Abstract
OBJECTIVES Brasenia is a monotypic genus in the family of Cabombaceae. The only species, B. schreberi, is a macrophyte distributed worldwide. Because it requires good water quality, it is endangered in China and other countries due to the deterioration of aquatic habitats. The young leaves and stems of B. schreberi are covered by thick mucilage, which has high medical value. As an allelopathic aquatic plant, it can also be used in the management of aquatic weeds. Here, we present its assembled and annotated genome to help shed light on medial and allelopathic substrates and facilitate their conservation. DATA DESCRIPTION Genomic DNA and RNA extracted from B. schreberi leaf tissues were used for whole genome and RNA sequencing using a Nanopore and/or MGI sequencer. The assembly was 1,055,148,839 bp in length, with 92 contigs and an N50 of 22,379,495 bp. The repetitive elements in the assembly were 555,442,205 bp. A completeness assessment of the assembly with BUSCO and compleasm indicated 88.4 and 90.9% completeness in the Eudicots database and 95.4 and 96.6% completeness in the Embryphyta database. Gene annotation revealed 67,747 genes that coded for 73,344 proteins.
Collapse
Affiliation(s)
- Lin-Fang Wu
- Guangzhou Linfang Ecological Technology Co., Ltd, 510000, Guangzhou, China
| | - Wei-Guang Zhu
- Key Laboratory of Vegetation Restoration and Management of Degraded Ecosystems, South China Botanical Garden, Chinese Academy of Sciences, 510650, Guangzhou, China
- Key Laboratory of National Forestry and Grassland Administration on Plant Conservation and Utilization in Southern China, South China Botanical Garden, Chinese Academy of Sciences, 510650, Guangzhou, China
- Guangdong Provincial Key Laboratory of Applied Botany, South China Botanical Garden, Chinese Academy of Sciences, 510650, Guangzhou, China
- South China National Botanical Garden, 510650, Guangzhou, China
| | - En-Ping Yu
- Key Laboratory of Vegetation Restoration and Management of Degraded Ecosystems, South China Botanical Garden, Chinese Academy of Sciences, 510650, Guangzhou, China
- Key Laboratory of National Forestry and Grassland Administration on Plant Conservation and Utilization in Southern China, South China Botanical Garden, Chinese Academy of Sciences, 510650, Guangzhou, China
- Guangdong Provincial Key Laboratory of Applied Botany, South China Botanical Garden, Chinese Academy of Sciences, 510650, Guangzhou, China
- South China National Botanical Garden, 510650, Guangzhou, China
- University of Chinese Academy of Sciences, 100049, Beijing, China
| | - Hong-Lin Cao
- Key Laboratory of Vegetation Restoration and Management of Degraded Ecosystems, South China Botanical Garden, Chinese Academy of Sciences, 510650, Guangzhou, China.
- Key Laboratory of National Forestry and Grassland Administration on Plant Conservation and Utilization in Southern China, South China Botanical Garden, Chinese Academy of Sciences, 510650, Guangzhou, China.
- Guangdong Provincial Key Laboratory of Applied Botany, South China Botanical Garden, Chinese Academy of Sciences, 510650, Guangzhou, China.
- South China National Botanical Garden, 510650, Guangzhou, China.
| | - Zheng-Feng Wang
- Key Laboratory of Vegetation Restoration and Management of Degraded Ecosystems, South China Botanical Garden, Chinese Academy of Sciences, 510650, Guangzhou, China.
- Key Laboratory of National Forestry and Grassland Administration on Plant Conservation and Utilization in Southern China, South China Botanical Garden, Chinese Academy of Sciences, 510650, Guangzhou, China.
- Guangdong Provincial Key Laboratory of Applied Botany, South China Botanical Garden, Chinese Academy of Sciences, 510650, Guangzhou, China.
- South China National Botanical Garden, 510650, Guangzhou, China.
| |
Collapse
|
4
|
Southey BR, Romanova EV, Rodriguez-Zas SL, Sweedler JV. Bioinformatics for Prohormone and Neuropeptide Discovery. Methods Mol Biol 2024; 2758:151-178. [PMID: 38549013 PMCID: PMC11045269 DOI: 10.1007/978-1-0716-3646-6_8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/02/2024]
Abstract
Neuropeptides and peptide hormones are signaling molecules produced via complex posttranslational modifications of precursor proteins known as prohormones. Neuropeptides activate specific receptors and are associated with the regulation of physiological systems and behaviors. The identification of prohormones-and the neuropeptides created by these prohormones-from genomic assemblies has become essential to support the annotation and use of the rapidly growing number of sequenced genomes. Here we describe a well-validated methodology for identifying the prohormone complement from genomic assemblies that employs widely available public toolsets and databases. The uncovered prohormone sequences can then be screened for putative neuropeptides to enable accurate proteomic discovery and validation.
Collapse
Affiliation(s)
- Bruce R Southey
- Department of Animal Sciences, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | - Elena V Romanova
- Department of Chemistry, Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | - Sandra L Rodriguez-Zas
- Department of Animal Sciences, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | - Jonathan V Sweedler
- Department of Chemistry, Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana-Champaign, Urbana, IL, USA.
| |
Collapse
|
5
|
Ismail E, Gad W, Hashem M. A hybrid Stacking-SMOTE model for optimizing the prediction of autistic genes. BMC Bioinformatics 2023; 24:379. [PMID: 37803253 PMCID: PMC10559615 DOI: 10.1186/s12859-023-05501-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2023] [Accepted: 09/27/2023] [Indexed: 10/08/2023] Open
Abstract
PURPOSE Autism spectrum disorder(ASD) is a disease associated with the neurodevelopment of the brain. The autism spectrum can be observed in early childhood, where the symptoms of the disease usually appear in children within the first year of their life. Currently, ASD can only be diagnosed based on the apparent symptoms due to the lack of information on genes related to the disease. Therefore, in this paper, we need to predict the largest number of disease-causing genes for a better diagnosis. METHODS A hybrid stacking ensemble model with Synthetic Minority Oversampling TEchnique (Stack-SMOTE) is proposed to predict the genes associated with ASD. The proposed model uses the gene ontology database to measure the similarities between the genes using a hybrid gene similarity function(HGS). HGS is effective in measuring the similarity as it combines the features of information gain-based methods and graph-based methods. The proposed model solves the imbalanced ASD dataset problem using the Synthetic Minority Oversampling Technique (SMOTE), which generates synthetic data rather than duplicates the data to reduce the overfitting. Sequentially, a gradient boosting-based random forest classifier (GBBRF) is introduced as a new combination technique to enhance the prediction of ASD genes. Moreover, the GBBRF classifier combined with random forest(RF), k-nearest neighbor, support vector machine(SVM), and logistic regression(LR) to form the proposed Stacking-SMOTE model to optimize the prediction of ASD genes. RESULTS The proposed Stacking-SMOTE model is evaluated using the Simons Foundation Autism Research Initiative (SFARI) gene database and a set of candidates ASD genes.The results of the proposed model-based SMOTE outperform other reported undersampling and oversampling techniques. Sequentially, the results of GBBRF achieve higher accuracy than using the basic classifiers. Moreover, the experimental results show that the proposed Stacking-SMOTE model outperforms the existing ASD prediction models with approximately 95.5% accuracy. CONCLUSION The proposed Stacking-SMOTE model demonstrates that SMOTE is effective in handling the autism imbalanced data. Sequentially, the integration between the gradient boosting and random forest classifier (GBBRF) support to build a robust stacking ensemble model(Stacking-SMOTE).
Collapse
Affiliation(s)
- Eman Ismail
- Information Systems Department, Faculty of Computer and Information Sciences, Ain Shams University, Cairo, Egypt
| | - Walaa Gad
- Information Systems Department, Faculty of Computer and Information Sciences, Ain Shams University, Cairo, Egypt
| | - Mohamed Hashem
- Information Systems Department, Faculty of Computer and Information Sciences, Ain Shams University, Cairo, Egypt
| |
Collapse
|
6
|
Brůna T, Li H, Guhlin J, Honsel D, Herbold S, Stanke M, Nenasheva N, Ebel M, Gabriel L, Hoff KJ. Galba: genome annotation with miniprot and AUGUSTUS. BMC Bioinformatics 2023; 24:327. [PMID: 37653395 PMCID: PMC10472564 DOI: 10.1186/s12859-023-05449-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2023] [Accepted: 08/21/2023] [Indexed: 09/02/2023] Open
Abstract
BACKGROUND The Earth Biogenome Project has rapidly increased the number of available eukaryotic genomes, but most released genomes continue to lack annotation of protein-coding genes. In addition, no transcriptome data is available for some genomes. RESULTS Various gene annotation tools have been developed but each has its limitations. Here, we introduce GALBA, a fully automated pipeline that utilizes miniprot, a rapid protein-to-genome aligner, in combination with AUGUSTUS to predict genes with high accuracy. Accuracy results indicate that GALBA is particularly strong in the annotation of large vertebrate genomes. We also present use cases in insects, vertebrates, and a land plant. GALBA is fully open source and available as a docker image for easy execution with Singularity in high-performance computing environments. CONCLUSIONS Our pipeline addresses the critical need for accurate gene annotation in newly sequenced genomes, and we believe that GALBA will greatly facilitate genome annotation for diverse organisms.
Collapse
Affiliation(s)
- Tomáš Brůna
- U.S. Department of Energy Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720 USA
| | - Heng Li
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, 02215 MA USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, 02215 MA USA
| | - Joseph Guhlin
- Genomics Aotearoa and Laboratory for Evolution and Development, Department of Biochemistry, University of Otago, Dunedin, 9016 New Zealand
| | - Daniel Honsel
- Institute of Computer Science, University of Göttingen, 37077 Göttingen, Germany
| | - Steffen Herbold
- Faculty for Computer Science and Mathematics, University of Passau, 94032 Passau, Germany
| | - Mario Stanke
- Institute of Mathematics and Computer Science, and Center for Functional Genomics of Microbes, University of Greifswald, 17489 Greifswald, Germany
| | - Natalia Nenasheva
- Institute of Mathematics and Computer Science, and Center for Functional Genomics of Microbes, University of Greifswald, 17489 Greifswald, Germany
| | - Matthis Ebel
- Institute of Mathematics and Computer Science, and Center for Functional Genomics of Microbes, University of Greifswald, 17489 Greifswald, Germany
| | - Lars Gabriel
- Institute of Mathematics and Computer Science, and Center for Functional Genomics of Microbes, University of Greifswald, 17489 Greifswald, Germany
| | - Katharina J. Hoff
- Institute of Mathematics and Computer Science, and Center for Functional Genomics of Microbes, University of Greifswald, 17489 Greifswald, Germany
| |
Collapse
|
7
|
Ismail E, Gad W, Hashem M. HEC-ASD: a hybrid ensemble-based classification model for predicting autism spectrum disorder disease genes. BMC Bioinformatics 2022; 23:554. [PMID: 36544099 PMCID: PMC9768984 DOI: 10.1186/s12859-022-05099-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2022] [Accepted: 12/06/2022] [Indexed: 12/24/2022] Open
Abstract
PURPOSE Autism spectrum disorder (ASD) is the most prevalent disease today. The causes of its infection may be attributed to genetic causes by 80% and environmental causes by 20%. In spite of this, the majority of the current research is concerned with environmental causes, and the least proportion with the genetic causes of the disease. Autism is a complex disease, which makes it difficult to identify the genes that cause the disease. METHODS Hybrid ensemble-based classification (HEC-ASD) model for predicting ASD genes using gradient boosting machines is proposed. The proposed model utilizes gene ontology (GO) to construct a gene functional similarity matrix using hybrid gene similarity (HGS) method. HGS measures the semantic similarity between genes effectively. It combines the graph-based method, such as Wang method with the number of directed children's nodes of gene term from GO. Moreover, an ensemble gradient boosting classifier is adapted to enhance the prediction of genes forming a robust classification model. RESULTS The proposed model is evaluated using the Simons Foundation Autism Research Initiative (SFARI) gene database. The experimental results are promising as they improve the classification performance for predicting ASD genes. The results are compared with other approaches that used gene regulatory network (GRN), protein to protein interaction network (PPI), or GO. The HEC-ASD model reaches the highest prediction accuracy of 0.88% using ensemble learning classifiers. CONCLUSION The proposed model demonstrates that ensemble learning technique using gradient boosting is effective in predicting autism spectrum disorder genes. Moreover, the HEC-ASD model utilized GO rather than using PPI network and GRN.
Collapse
Affiliation(s)
- Eman Ismail
- grid.7269.a0000 0004 0621 1570Information Systems Department, Faculty of Computer and Information Sciences, Ain Shams University, Cairo, Egypt
| | - Walaa Gad
- grid.7269.a0000 0004 0621 1570Information Systems Department, Faculty of Computer and Information Sciences, Ain Shams University, Cairo, Egypt
| | - Mohamed Hashem
- grid.7269.a0000 0004 0621 1570Information Systems Department, Faculty of Computer and Information Sciences, Ain Shams University, Cairo, Egypt
| |
Collapse
|
8
|
Zheng Z, Hu H, Gao S, Zhou H, Luo W, Kage U, Liu C, Jia J. Leaf thickness of barley: genetic dissection, candidate genes prediction and its relationship with yield-related traits. Theor Appl Genet 2022; 135:1843-1854. [PMID: 35348823 DOI: 10.1007/s00122-022-04076-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/02/2021] [Accepted: 03/07/2022] [Indexed: 06/14/2023]
Abstract
In this first genetic study on assessing leaf thickness directly in cereals, major and environmentally stable QTL were detected in barley and candidate genes underlying a major locus were identified. Leaf thickness (LT) is an important characteristic affecting leaf functions which have been intensively studied. However, as LT has a small dimension in many plant species and technically difficult to measure, previous studies on this characteristic are often based on indirect estimations. In the first study of detecting QTL controlling LT by directly measuring the characteristic in barley, large and stable loci were detected from both field and glasshouse trials conducted in different cropping seasons by assessing a population of 201 recombinant inbred lines. Four loci (locating on chromosome arms 2H, 3H, 5H and 6H, respectively) were consistently detected for flag leaf thickness (FLT) in each of these trials. The one on 6H had the largest effect, with a maximum LOD 9.8 explaining up to 20.9% of phenotypic variance. FLT does not only show strong interactions with flag leaf width and flag leaf area but has also strong correlations with fertile tiller number, spike row types, kernel number per spike and heading date. Though with reduced efficiency, these loci were also detectable from assessing second last leaf of fully grown plants or even from assessing the third leaves of seedlings. Taking advantage of the high-quality genome assemblies for both parents of the mapping population used in this study, three candidate genes underlying the 6H QTL were predicted based on orthologous analysis. These results do not only broaden our understanding on genetic basis of LT and its relationship with other traits in cereal crops but also form the bases for cloning and functional analysis of genes regulating LT in barley.
Collapse
Affiliation(s)
- Zhi Zheng
- CSIRO Agriculture and Food, 306 Carmody Road, St Lucia, QLD, 4067, Australia
| | - Haiyan Hu
- College of Life Science and Technology, Henan Institute of Science and Technology, Xinxiang, 453003, Henan, China
| | - Shang Gao
- School of Life Science, Tsinghua University, Beijing, 100084, China
| | - Hong Zhou
- CSIRO Agriculture and Food, 306 Carmody Road, St Lucia, QLD, 4067, Australia
- Triticeae Research Institute, Sichuan Agricultural University, Wenjiang, Chengdu, 611130, China
| | - Wei Luo
- CSIRO Agriculture and Food, 306 Carmody Road, St Lucia, QLD, 4067, Australia
- Triticeae Research Institute, Sichuan Agricultural University, Wenjiang, Chengdu, 611130, China
| | - Udaykumar Kage
- CSIRO Agriculture and Food, 306 Carmody Road, St Lucia, QLD, 4067, Australia
| | - Chunji Liu
- CSIRO Agriculture and Food, 306 Carmody Road, St Lucia, QLD, 4067, Australia.
| | - Jizeng Jia
- National Key Facility for Crop Gene Resources and Genetic Improvement, Institute of Crop Sciences, Chinese Academy of Agricultural Sciences, Beijing, 100081, China.
| |
Collapse
|
9
|
Hurgobin B. Annotation of Protein-Coding Genes in Plant Genomes. Methods Mol Biol 2022; 2443:309-326. [PMID: 35037214 DOI: 10.1007/978-1-0716-2067-0_17] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Advances in next-generation sequencing technologies and the lower sequencing costs are paving the way to more plant genome sequencing, assembly, and annotation projects. While genome assembly is the first step toward elucidating the genome structure of a species, it is the annotation of the protein-coding genes that provide meaningful information to biologists. However, genome annotation is not a trivial task. Therefore, the aim of this chapter is to provide a detailed view of this important process, including tools and commands that can be used to carry out such a process.
Collapse
Affiliation(s)
- Bhavna Hurgobin
- La Trobe Institute for Agriculture and Food, Department of Animal, Plant and Soil Sciences, School of Life Sciences, AgriBio Building, La Trobe University, Bundoora, VIC, Australia.
- Australian Research Council Research Hub for Medicinal Agriculture, AgriBio Building, La Trobe University, Bundoora, VIC, Australia.
| |
Collapse
|
10
|
Manoharan S, Iyyappan OR. A Hybrid Protocol for Finding Novel Gene Targets for Various Diseases Using Microarray Expression Data Analysis and Text Mining. Methods Mol Biol 2022; 2496:41-70. [PMID: 35713858 DOI: 10.1007/978-1-0716-2305-3_3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
The advancement in technology for various scientific experiments and the amount of raw data produced from that is enormous, thus giving rise to various subsets of biologists working with genome, proteome, transcriptome, expression, pathway, and so on. This has led to exponential growth in scientific literature which is becoming beyond the means of manual curation and annotation for extracting information of importance. Microarray data are expression data, analysis of which results in a set of up/downregulated lists of genes that are functionally annotated to ascertain the biological meaning of genes. These genes are represented as vocabularies and/or Gene Ontology terms when associated with pathway enrichment analysis need relational and conceptual understanding to a disease. The chapter deals with a hybrid approach we designed for identifying novel drug-disease targets. Microarray data for muscular dystrophy is explored here as an example and text mining approaches are utilized with an aim to identify promisingly novel drug targets. Our main objective is to give a basic overview from a biologist's perspective for whom text mining approaches of data mining and information retrieval is fairly a new concept. The chapter aims to bridge the gap between biologist and computational text miners and bring about unison for a more informative research in a fast and time efficient manner.
Collapse
Affiliation(s)
- Sharanya Manoharan
- Department of Bioinformatics, Stella Maris College (Autonomous), Chennai, Tamilnadu, India.
| | - Oviya Ramalakshmi Iyyappan
- Department of Sciences, Amrita School of Engineering, Amrita Vishwa Vidyapeetham, Chennai, Tamilnadu, India
| |
Collapse
|
11
|
Ye J, Wang S, Yang X, Tang X. Gene prediction of aging-related diseases based on DNN and Mashup. BMC Bioinformatics 2021; 22:597. [PMID: 34920719 DOI: 10.1186/s12859-021-04518-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2021] [Accepted: 11/30/2021] [Indexed: 11/17/2022] Open
Abstract
Background At present, the bioinformatics research on the relationship between aging-related diseases and genes is mainly through the establishment of a machine learning multi-label model to classify each gene. Most of the existing methods for predicting pathogenic genes mainly rely on specific types of gene features, or directly encode multiple features with different dimensions, use the same encoder to concatenate and predict the final results, which will be subject to many limitations in the applicability of the algorithm. Possible shortcomings of the above include: incomplete coverage of gene features by a single type of biomics data, overfitting of small dimensional datasets by a single encoder, or underfitting of larger dimensional datasets. Methods We use the known gene disease association data and gene descriptors, such as gene ontology terms (GO), protein interaction data (PPI), PathDIP, Kyoto Encyclopedia of genes and genomes Genes (KEGG), etc, as input for deep learning to predict the association between genes and diseases. Our innovation is to use Mashup algorithm to reduce the dimensionality of PPI, GO and other large biological networks, and add new pathway data in KEGG database, and then combine a variety of biological information sources through modular Deep Neural Network (DNN) to predict the genes related to aging diseases. Result and conclusion The results show that our algorithm is more effective than the standard neural network algorithm (the Area Under the ROC curve from 0.8795 to 0.9153), gradient enhanced tree classifier and logistic regression classifier. In this paper, we firstly use DNN to learn the similar genes associated with the known diseases from the complex multi-dimensional feature space, and then provide the evidence that the assumed genes are associated with a certain disease. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04518-5.
Collapse
|
12
|
Abstract
BACKGROUND BRAKER is a suite of automatic pipelines, BRAKER1 and BRAKER2, for the accurate annotation of protein-coding genes in eukaryotic genomes. Each pipeline trains statistical models of protein-coding genes based on provided evidence and, then predicts protein-coding genes in genomic sequences using both the extrinsic evidence and statistical models. For training and prediction, BRAKER1 and BRAKER2 incorporate complementary extrinsic evidence: BRAKER1 uses only RNA-seq data while BRAKER2 uses only a database of cross-species proteins. The BRAKER suite has so far not been able to reliably exceed the accuracy of BRAKER1 and BRAKER2 when incorporating both types of evidence simultaneously. Currently, for a novel genome project where both RNA-seq and protein data are available, the best option is to run both pipelines independently, and to pick one, likely better output. Therefore, one or another type of the extrinsic evidence would remain unexploited. RESULTS We present TSEBRA, a software that selects gene predictions (transcripts) from the sets generated by BRAKER1 and BRAKER2. TSEBRA uses a set of rules to compare scores of overlapping transcripts based on their support by RNA-seq and homologous protein evidence. We show in computational experiments on genomes of 11 species that TSEBRA achieves higher accuracy than either BRAKER1 or BRAKER2 running alone and that TSEBRA compares favorably with the combiner tool EVidenceModeler. CONCLUSION TSEBRA is an easy-to-use and fast software tool. It can be used in concert with the BRAKER pipeline to generate a gene prediction set supported by both RNA-seq and homologous protein evidence.
Collapse
Affiliation(s)
- Lars Gabriel
- Institute of Mathematics and Computer Science, University of Greifswald, Walther-Rathenau-Str. 47, 17489 Greifswald, Germany
- Center for Functional Genomics of Microbes, University of Greifswald, Felix-Hausdorff-Str. 8, 17489 Greifswald, Germany
| | - Katharina J. Hoff
- Institute of Mathematics and Computer Science, University of Greifswald, Walther-Rathenau-Str. 47, 17489 Greifswald, Germany
- Center for Functional Genomics of Microbes, University of Greifswald, Felix-Hausdorff-Str. 8, 17489 Greifswald, Germany
| | - Tomáš Brůna
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, GA 30332 USA
| | - Mark Borodovsky
- Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, GA 30332 USA
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA 30332 USA
| | - Mario Stanke
- Institute of Mathematics and Computer Science, University of Greifswald, Walther-Rathenau-Str. 47, 17489 Greifswald, Germany
- Center for Functional Genomics of Microbes, University of Greifswald, Felix-Hausdorff-Str. 8, 17489 Greifswald, Germany
| |
Collapse
|
13
|
Yang C, Chowdhury D, Zhang Z, Cheung WK, Lu A, Bian Z, Zhang L. A review of computational tools for generating metagenome-assembled genomes from metagenomic sequencing data. Comput Struct Biotechnol J 2021; 19:6301-6314. [PMID: 34900140 PMCID: PMC8640167 DOI: 10.1016/j.csbj.2021.11.028] [Citation(s) in RCA: 60] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2021] [Revised: 11/17/2021] [Accepted: 11/17/2021] [Indexed: 12/16/2022] Open
Abstract
Metagenomic sequencing provides a culture-independent avenue to investigate the complex microbial communities by constructing metagenome-assembled genomes (MAGs). A MAG represents a microbial genome by a group of sequences from genome assembly with similar characteristics. It enables us to identify novel species and understand their potential functions in a dynamic ecosystem. Many computational tools have been developed to construct and annotate MAGs from metagenomic sequencing, however, there is a prominent gap to comprehensively introduce their background and practical performance. In this paper, we have thoroughly investigated the computational tools designed for both upstream and downstream analyses, including metagenome assembly, metagenome binning, gene prediction, functional annotation, taxonomic classification, and profiling. We have categorized the commonly used tools into unique groups based on their functional background and introduced the underlying core algorithms and associated information to demonstrate a comparative outlook. Furthermore, we have emphasized the computational requisition and offered guidance to the users to select the most efficient tools. Finally, we have indicated current limitations, potential solutions, and future perspectives for further improving the tools of MAG construction and annotation. We believe that our work provides a consolidated resource for the current stage of MAG studies and shed light on the future development of more effective MAG analysis tools on metagenomic sequencing.
Collapse
Key Words
- CNN, convolutional neural network
- DBG, De Bruijn graph
- GTDB, Genome Taxonomy Database
- Gene functional annotation
- Gene prediction
- Genome assembly
- HMM, Hidden Markov Model
- KEGG, Kyoto Encyclopedia of Genes and Genomes
- LCA, lowest common ancestor
- LPA, label propagation algorithm
- MAGs, metagenome-assembled genomes
- Metagenome binning
- Metagenome-assembled genomes
- Metagenomic sequencing
- Microbial abundance profiling
- OLC, overlap-layout consensus
- ONT, Oxford Nanopore Technologies
- ORFs, open reading frames
- PacBio, Pacific Biosciences
- QC, quality control
- SLR, synthetic long reads
- TNFs, tetranucleotide frequencies
- Taxonomic classification
Collapse
Affiliation(s)
- Chao Yang
- Department of Computer Science, Hong Kong Baptist University, Hong Kong Special Administrative Region
| | - Debajyoti Chowdhury
- Computational Medicine Lab, Hong Kong Baptist University, Hong Kong Special Administrative Region
- Institute of Integrated Bioinformedicine and Translational Sciences, School of Chinese Medicine, Hong Kong Baptist University, Hong Kong Special Administrative Region
| | - Zhenmiao Zhang
- Department of Computer Science, Hong Kong Baptist University, Hong Kong Special Administrative Region
| | - William K. Cheung
- Department of Computer Science, Hong Kong Baptist University, Hong Kong Special Administrative Region
| | - Aiping Lu
- Computational Medicine Lab, Hong Kong Baptist University, Hong Kong Special Administrative Region
- Institute of Integrated Bioinformedicine and Translational Sciences, School of Chinese Medicine, Hong Kong Baptist University, Hong Kong Special Administrative Region
| | - Zhaoxiang Bian
- Institute of Brain and Gut Research, School of Chinese Medicine, Hong Kong Baptist University, Hong Kong Special Administrative Region
- Chinese Medicine Clinical Study Center, School of Chinese Medicine, Hong Kong Baptist University, Hong Kong Special Administrative Region
| | - Lu Zhang
- Department of Computer Science, Hong Kong Baptist University, Hong Kong Special Administrative Region
- Computational Medicine Lab, Hong Kong Baptist University, Hong Kong Special Administrative Region
| |
Collapse
|
14
|
Kimbrel JA, Jeffrey BM, Ward CS. Prokaryotic Genome Annotation. Methods Mol Biol 2021; 2349:193-214. [PMID: 34718997 DOI: 10.1007/978-1-0716-1585-0_10] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/19/2023]
Abstract
In the last decade, the high-throughput and relatively low cost of short-read sequencing technologies have revolutionized prokaryotic genomics. This has led to an exponential increase in the number of bacterial and archaeal genome sequences available, as well as corresponding increase of genome assembly and annotation tools developed. Together, these hardware and software technologies have given scientists unprecedented options to study their chosen microbial systems without the need for large teams of bioinformaticists or supercomputing facilities. While these analysis tools largely fall into only a few categories, each may have different requirements, caveats and file formats, and some may be rarely updated or even abandoned. And so, despite the apparent ease in sequencing and analyzing a prokaryotic genome, it is no wonder that the budding genomicist may quickly find oneself overwhelmed. Here, we aim to provide the reader with an overview of genome annotation and its most important considerations, as well as an easy-to-follow protocol to get started with annotating a prokaryotic genome.
Collapse
Affiliation(s)
- Jeffrey A Kimbrel
- Biosciences and Biotechnology Division, Lawrence Livermore National Laboratory, Livermore, CA, USA.
| | - Brendan M Jeffrey
- Bioinformatics and Computational Biosciences Branch, Rocky Mountain Laboratories, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Hamilton, MA, USA
| | - Christopher S Ward
- Biosciences and Biotechnology Division, Lawrence Livermore National Laboratory, Livermore, CA, USA
- Department of Biological Sciences, Bowling Green State University, Bowling Green, OH, USA
| |
Collapse
|
15
|
Yu J, Guo L, Dou X, Jiang W, Qian B, Liu J, Wang J, Wang C, Xu C. Comprehensive evaluation of protein-coding sORFs prediction based on a random sequence strategy. Front Biosci (Landmark Ed) 2021; 26:272-278. [PMID: 34455759 DOI: 10.52586/4943] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2021] [Revised: 06/08/2021] [Accepted: 07/07/2021] [Indexed: 11/09/2022]
Abstract
Background: Small open reading frames (sORFs) with protein-coding ability present unprecedented challenge for genome annotation because of their short sequence and low expression level. In the past decade, only several prediction methods have been proposed for discovery of protein-coding sORFs and lack of objective and uniform negative datasets has become an important obstacle to sORFs prediction. The prediction efficiency of current sORFs prediction methods needs to be further evaluated to provide better research strategies for protein-coding sORFs discovery. Methods: In this work, nine mainstream existing methods for predicting protein-coding potential of ORFs are comprehensively evaluated based on a random sequence strategy. Results: The results show that the current methods perform poorly on different sORFs datasets. For comparison, a sequence based prediction algorithm trained on prokaryotic sORFs is proposed and its better prediction performance indicates that the random sequence strategy can provide feasible ideas for protein-coding sORFs predictions. Conclusions: As a kind of important functional genomic element, discovery of protein-coding sORFs has shed light on the dark proteomes. This evaluation work indicates that there is an urgent need for developing specialized prediction tools for protein-coding sORFs in both eukaryotes and prokaryotes. It is expected that the present work may provide novel ideas for future sORFs researches.
Collapse
Affiliation(s)
- Jiafeng Yu
- Shandong Key Laboratory of Biophysics, Institute of Biophysics, Dezhou University, 253023 Dezhou, Shandong, China
| | - Li Guo
- Department of Bioinformatics, Smart Health Big Data Analysis and Location Services Engineering Lab of Jiangsu Province, School of Geographic and Biologic Information, Nanjing University of Posts and Telecommunications, 210023 Nanjing, Jiangsu, China
| | - Xianghua Dou
- Shandong Key Laboratory of Biophysics, Institute of Biophysics, Dezhou University, 253023 Dezhou, Shandong, China
| | - Wenwen Jiang
- Department of Bioinformatics, Smart Health Big Data Analysis and Location Services Engineering Lab of Jiangsu Province, School of Geographic and Biologic Information, Nanjing University of Posts and Telecommunications, 210023 Nanjing, Jiangsu, China
| | - Bowen Qian
- Department of Bioinformatics, Smart Health Big Data Analysis and Location Services Engineering Lab of Jiangsu Province, School of Geographic and Biologic Information, Nanjing University of Posts and Telecommunications, 210023 Nanjing, Jiangsu, China
| | - Jian Liu
- Shandong Key Laboratory of Biophysics, Institute of Biophysics, Dezhou University, 253023 Dezhou, Shandong, China
| | - Jun Wang
- Department of Bioinformatics, Smart Health Big Data Analysis and Location Services Engineering Lab of Jiangsu Province, School of Geographic and Biologic Information, Nanjing University of Posts and Telecommunications, 210023 Nanjing, Jiangsu, China
| | - Chunling Wang
- Shandong Key Laboratory of Biophysics, Institute of Biophysics, Dezhou University, 253023 Dezhou, Shandong, China
| | - Congmin Xu
- Shandong Key Laboratory of Biophysics, Institute of Biophysics, Dezhou University, 253023 Dezhou, Shandong, China.,Department of Biomedical Engineering, Georgia Institute of Technology and Emory University, Atlanta, GA 30332, USA
| |
Collapse
|
16
|
Dziurzynski M, Decewicz P, Ciuchcinski K, Gorecki A, Dziewit L. Simple, Reliable, and Time-Efficient Manual Annotation of Bacterial Genomes with MAISEN. Methods Mol Biol 2021; 2242:221-9. [PMID: 33961227 DOI: 10.1007/978-1-0716-1099-2_14] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2023]
Abstract
Over the last 15 years, the costs of DNA sequencing have sharply fallen, effectively shifting the costs of DNA analysis from sequencing to bioinformatic curation and storage. A huge number of available DNA sequences (including genomes and metagenomes) resulted in the development of various tools for sequence annotation. While much effort has been invested into the development of automatic annotation pipelines, manual curation of their results is still necessary in order to obtain a reliable and strictly validated data. Unfortunately, due to its time-consuming nature, manual annotation is now rarely used.In this chapter, a protocol for efficient manual annotation of prokaryotic DNA sequences using a novel bioinformatic tool-MAISEN ( http://maisen.ddlemb.com ), is presented. MAISEN is a free, web-based tool designed to accelerate manual annotation, by providing the user with simple interface and precomputed alignments for each predicted feature. It was designed to be available for every scientist, regardless of their bioinformatic proficiency.
Collapse
|
17
|
Karimi E, Geslain E, Belcour A, Frioux C, Aïte M, Siegel A, Corre E, Dittami SM. Robustness analysis of metabolic predictions in algal microbial communities based on different annotation pipelines. PeerJ 2021; 9:e11344. [PMID: 33996285 PMCID: PMC8106915 DOI: 10.7717/peerj.11344] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2020] [Accepted: 04/03/2021] [Indexed: 01/29/2023] Open
Abstract
Animals, plants, and algae rely on symbiotic microorganisms for their development and functioning. Genome sequencing and genomic analyses of these microorganisms provide opportunities to construct metabolic networks and to analyze the metabolism of the symbiotic communities they constitute. Genome-scale metabolic network reconstructions rest on information gained from genome annotation. As there are multiple annotation pipelines available, the question arises to what extent differences in annotation pipelines impact outcomes of these analyses. Here, we compare five commonly used pipelines (Prokka, MaGe, IMG, DFAST, RAST) from predicted annotation features (coding sequences, Enzyme Commission numbers, hypothetical proteins) to the metabolic network-based analysis of symbiotic communities (biochemical reactions, producible compounds, and selection of minimal complementary bacterial communities). While Prokka and IMG produced the most extensive networks, RAST and DFAST networks produced the fewest false positives and the most connected networks with the fewest dead-end metabolites. Our results underline differences between the outputs of the tested pipelines at all examined levels, with small differences in the draft metabolic networks resulting in the selection of different microbial consortia to expand the metabolic capabilities of the algal host. However, the consortia generated yielded similar predicted producible compounds and could therefore be considered functionally interchangeable. This contrast between selected communities and community functions depending on the annotation pipeline needs to be taken into consideration when interpreting the results of metabolic complementarity analyses. In the future, experimental validation of bioinformatic predictions will likely be crucial to both evaluate and refine the pipelines and needs to be coupled with increased efforts to expand and improve annotations in reference databases.
Collapse
Affiliation(s)
- Elham Karimi
- UMR8227, Integrative Biology of Marine Models, Sorbonne Université/CNRS, Station Biologique de Roscoff, Roscoff, France
| | - Enora Geslain
- UMR8227, Integrative Biology of Marine Models, Sorbonne Université/CNRS, Station Biologique de Roscoff, Roscoff, France.,FR2424, Sorbonne Université/CNRS, Station Biologique de Roscoff, Roscoff, France
| | - Arnaud Belcour
- Equipe Dyliss, Univ Rennes, Inria, CNRS, IRISA, Rennes, France
| | | | - Méziane Aïte
- Equipe Dyliss, Univ Rennes, Inria, CNRS, IRISA, Rennes, France
| | - Anne Siegel
- Equipe Dyliss, Univ Rennes, Inria, CNRS, IRISA, Rennes, France
| | - Erwan Corre
- FR2424, Sorbonne Université/CNRS, Station Biologique de Roscoff, Roscoff, France
| | - Simon M Dittami
- UMR8227, Integrative Biology of Marine Models, Sorbonne Université/CNRS, Station Biologique de Roscoff, Roscoff, France
| |
Collapse
|
18
|
Banerjee S, Bhandary P, Woodhouse M, Sen TZ, Wise RP, Andorf CM. FINDER: an automated software package to annotate eukaryotic genes from RNA-Seq data and associated protein sequences. BMC Bioinformatics 2021; 22:205. [PMID: 33879057 PMCID: PMC8056616 DOI: 10.1186/s12859-021-04120-9] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2021] [Accepted: 04/07/2021] [Indexed: 12/23/2022] Open
Abstract
BACKGROUND Gene annotation in eukaryotes is a non-trivial task that requires meticulous analysis of accumulated transcript data. Challenges include transcriptionally active regions of the genome that contain overlapping genes, genes that produce numerous transcripts, transposable elements and numerous diverse sequence repeats. Currently available gene annotation software applications depend on pre-constructed full-length gene sequence assemblies which are not guaranteed to be error-free. The origins of these sequences are often uncertain, making it difficult to identify and rectify errors in them. This hinders the creation of an accurate and holistic representation of the transcriptomic landscape across multiple tissue types and experimental conditions. Therefore, to gauge the extent of diversity in gene structures, a comprehensive analysis of genome-wide expression data is imperative. RESULTS We present FINDER, a fully automated computational tool that optimizes the entire process of annotating genes and transcript structures. Unlike current state-of-the-art pipelines, FINDER automates the RNA-Seq pre-processing step by working directly with raw sequence reads and optimizes gene prediction from BRAKER2 by supplementing these reads with associated proteins. The FINDER pipeline (1) reports transcripts and recognizes genes that are expressed under specific conditions, (2) generates all possible alternatively spliced transcripts from expressed RNA-Seq data, (3) analyzes read coverage patterns to modify existing transcript models and create new ones, and (4) scores genes as high- or low-confidence based on the available evidence across multiple datasets. We demonstrate the ability of FINDER to automatically annotate a diverse pool of genomes from eight species. CONCLUSIONS FINDER takes a completely automated approach to annotate genes directly from raw expression data. It is capable of processing eukaryotic genomes of all sizes and requires no manual supervision-ideal for bench researchers with limited experience in handling computational tools.
Collapse
Affiliation(s)
- Sagnik Banerjee
- Program in Bioinformatics and Computational Biology, Iowa State University, Ames, IA, 50011, USA
- Department of Statistics, Iowa State University, Ames, IA, 50011, USA
| | - Priyanka Bhandary
- Program in Bioinformatics and Computational Biology, Iowa State University, Ames, IA, 50011, USA
- Department of Genetics, Developmental and Cell Biology, Iowa State University, Ames, IA, 50011, USA
| | - Margaret Woodhouse
- Corn Insects and Crop Genetics Research Unit, USDA-Agricultural Research Service, Ames, IA, 50011, USA
| | - Taner Z Sen
- Crop Improvement and Genetics Research Unit, USDA-Agricultural Research Service, Albany, CA, 94710, USA
| | - Roger P Wise
- Corn Insects and Crop Genetics Research Unit, USDA-Agricultural Research Service, Ames, IA, 50011, USA
- Department of Plant Pathology and Microbiology, Iowa State University, Ames, IA, 50011, USA
| | - Carson M Andorf
- Corn Insects and Crop Genetics Research Unit, USDA-Agricultural Research Service, Ames, IA, 50011, USA.
- Department of Computer Science, Iowa State University, Ames, IA, 50011, USA.
| |
Collapse
|
19
|
Yang X, Su Y, Wu J, Wan W, Chen H, Cao X, Wang J, Zhang Z, Wang Y, Ma D, Loake GJ, Jiang J. Parallel analysis of global garlic gene expression and alliin content following leaf wounding. BMC Plant Biol 2021; 21:174. [PMID: 33838642 PMCID: PMC8035738 DOI: 10.1186/s12870-021-02948-0] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/28/2020] [Accepted: 03/29/2021] [Indexed: 06/12/2023]
Abstract
BACKGROUND Allium sativum (garlic) is an economically important food source and medicinal plant rich in sulfides and other protective substances such as alliin, the precursor of allicin biosynthesis. Cysteine, serine and sulfur is the precursor of alliin biosynthesis. However, little is known about the alliin content under abiotic stress or the mechanism by which it is synthesized. RESULTS The findings revealed that the content of alliin was lowest in the garlic roots, and highest in the buds. Furthermore, alliin levels decreased in mature leaves following wounding. Transcriptome data generated over time after wounding further revealed significant up-regulation of genes integral to the biosynthetic pathways of cysteine and serine in mature garlic leaves. CONCLUSIONS The findings suggest that differential expression of cysteine, serine and sulfide-related genes underlies the accumulation of alliin and its precursors in garlic, providing a basis for further analyses of alliin biosynthesis.
Collapse
Affiliation(s)
- Xuqin Yang
- The Key Laboratory of Biotechnology for Medicinal Plant of Jiangsu Province, School of Life Science, Jiangsu Normal University, Xuzhou, 221116, Jiangsu, China
| | - Yiren Su
- The Key Laboratory of Biotechnology for Medicinal Plant of Jiangsu Province, School of Life Science, Jiangsu Normal University, Xuzhou, 221116, Jiangsu, China
| | - Jiaying Wu
- The Key Laboratory of Biotechnology for Medicinal Plant of Jiangsu Province, School of Life Science, Jiangsu Normal University, Xuzhou, 221116, Jiangsu, China
| | - Wen Wan
- The Key Laboratory of Biotechnology for Medicinal Plant of Jiangsu Province, School of Life Science, Jiangsu Normal University, Xuzhou, 221116, Jiangsu, China
| | - Huijian Chen
- XuZhou Nuote Chemical co., Ltd., Xuzhou, 221137, Jiangsu, China
| | - Xiaoying Cao
- The Key Laboratory of Biotechnology for Medicinal Plant of Jiangsu Province, School of Life Science, Jiangsu Normal University, Xuzhou, 221116, Jiangsu, China
| | - Junjuan Wang
- The Key Laboratory of Biotechnology for Medicinal Plant of Jiangsu Province, School of Life Science, Jiangsu Normal University, Xuzhou, 221116, Jiangsu, China
| | - Zhong Zhang
- The Key Laboratory of Biotechnology for Medicinal Plant of Jiangsu Province, School of Life Science, Jiangsu Normal University, Xuzhou, 221116, Jiangsu, China
| | - Youzhi Wang
- XuZhou Nuote Chemical co., Ltd., Xuzhou, 221137, Jiangsu, China
| | - Deliang Ma
- XuZhou Nuote Chemical co., Ltd., Xuzhou, 221137, Jiangsu, China
| | - G J Loake
- Institute of Molecular Plant Sciences, School of Biological Sciences, University of Edinburgh, Edinburgh, EH9 3JH, UK
| | - Jihong Jiang
- The Key Laboratory of Biotechnology for Medicinal Plant of Jiangsu Province, School of Life Science, Jiangsu Normal University, Xuzhou, 221116, Jiangsu, China.
| |
Collapse
|
20
|
Silva R, Padovani K, Góes F, Alves R. geneRFinder: gene finding in distinct metagenomic data complexities. BMC Bioinformatics 2021; 22:87. [PMID: 33632132 PMCID: PMC7905635 DOI: 10.1186/s12859-021-03997-w] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2020] [Accepted: 02/04/2021] [Indexed: 12/01/2022] Open
Abstract
Background Microbes perform a fundamental economic, social, and environmental role in our society. Metagenomics makes it possible to investigate microbes in their natural environments (the complex communities) and their interactions. The way they act is usually estimated by looking at the functions they play in those environments and their responsibility is measured by their genes. The advances of next-generation sequencing technology have facilitated metagenomics research however it also creates a heavy computational burden. Large and complex biological datasets are available as never before. There are many gene predictors available that can aid the gene annotation process though they lack handling appropriately metagenomic data complexities. There is no standard metagenomic benchmark data for gene prediction. Thus, gene predictors may inflate their results by obfuscating low false discovery rates. Results We introduce geneRFinder, an ML-based gene predictor able to outperform state-of-the-art gene prediction tools across this benchmark by using only one pre-trained Random Forest model. Average prediction rates of geneRFinder differed in percentage terms by 54% and 64%, respectively, against Prodigal and FragGeneScan while handling high complexity metagenomes. The specificity rate of geneRFinder had the largest distance against FragGeneScan, 79 percentage points, and 66 more than Prodigal. According to McNemar’s test, all percentual differences between predictors performances are statistically significant for all datasets with a 99% confidence interval. Conclusions We provide geneRFinder, an approach for gene prediction in distinct metagenomic complexities, available at gitlab.com/r.lorenna/generfinder and https://osf.io/w2yd6/, and also we provide a novel, comprehensive benchmark data for gene prediction—which is based on The Critical Assessment of Metagenome Interpretation (CAMI) challenge, and contains labeled data from gene regions—available at https://sourceforge.net/p/generfinder-benchmark.
Collapse
Affiliation(s)
- Raíssa Silva
- Vale Institute of Technology, Boaventura da Silva, 955, Belém, BR, 66055-090, Brazil.,PPGCC, Federal University of Pará, Augusto Corrêa, 01, Belém, BR, 66075-110, Brazil
| | - Kleber Padovani
- PPGCC, Federal University of Pará, Augusto Corrêa, 01, Belém, BR, 66075-110, Brazil
| | - Fabiana Góes
- ICMC, University of São Paulo, Trab. São Carlense, 400, São Carlos, BR, 13566-590, Brazil
| | - Ronnie Alves
- Vale Institute of Technology, Boaventura da Silva, 955, Belém, BR, 66055-090, Brazil. .,PPGCC, Federal University of Pará, Augusto Corrêa, 01, Belém, BR, 66075-110, Brazil.
| |
Collapse
|
21
|
Yadav C, Smith M, Ogunremi D, Yack J. Draft genome assembly and annotation of the masked birch caterpillar, Drepana arcuata (Lepidoptera: Drepanoidea). Data Brief 2020; 33:106531. [PMID: 33299908 PMCID: PMC7704289 DOI: 10.1016/j.dib.2020.106531] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2020] [Revised: 11/05/2020] [Accepted: 11/09/2020] [Indexed: 11/12/2022] Open
Abstract
The masked birch caterpillar, Drepana arcuata Walker (Lepidoptera: Drepanidae), and other Drepanoidea (Lepidoptera) species are excellent organisms for investigating the function and evolution of vibratory communication and sociality in caterpillars. We present a de novo assembled draft genome and functional annotation for D. arcuata, using a combination of short and long sequencing reads generated by Illumina HiSeq X and Oxford Nanopore Technologies (ONT) MinION sequencing platforms, respectively. A total of 460,694,612 150bp paired-end Illumina and 395,890 ONT raw reads were assembled into 11,493 scaffolds spanning a genome size of 270.5Mb. The resulting D. arcuata genome has a GC content of 38.79%, repeat content of 8.26%, is 86.5% complete based on Benchmarking Universal Single-Copy Orthologs (BUSCO) assessment, and comprises 10,398 predicted protein-coding genes. These data represent the first genomic resources for the lepidopteran superfamily Drepanoidea. Although the order Lepidoptera comprises numerous ecologically and economically important species, assembled genomes and annotations are available for < 1% of the total species. These data can be further utilized for research on Lepidoptera genomics as well as on the function and evolution of vibratory communication and sociality in larval insects.
Collapse
Affiliation(s)
- Chanchal Yadav
- Department of Biology, Carleton University, Ottawa, Ontario K1S 5B6, Canada
| | - Myron Smith
- Department of Biology, Carleton University, Ottawa, Ontario K1S 5B6, Canada
| | - Dele Ogunremi
- Canadian Food Inspection Agency, Ottawa Laboratory Fallowfield, Ontario K2J 4S1, Canada
| | - Jayne Yack
- Department of Biology, Carleton University, Ottawa, Ontario K1S 5B6, Canada
| |
Collapse
|
22
|
Meyer C, Scalzitti N, Jeannin-Girardon A, Collet P, Poch O, Thompson JD. Understanding the causes of errors in eukaryotic protein-coding gene prediction: a case study of primate proteomes. BMC Bioinformatics 2020; 21:513. [PMID: 33172385 PMCID: PMC7656754 DOI: 10.1186/s12859-020-03855-1] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2020] [Accepted: 10/30/2020] [Indexed: 11/10/2022] Open
Abstract
Background Recent advances in sequencing technologies have led to an explosion in the number of genomes available, but accurate genome annotation remains a major challenge. The prediction of protein-coding genes in eukaryotic genomes is especially problematic, due to their complex exon–intron structures. Even the best eukaryotic gene prediction algorithms can make serious errors that will significantly affect subsequent analyses. Results We first investigated the prevalence of gene prediction errors in a large set of 176,478 proteins from ten primate proteomes available in public databases. Using the well-studied human proteins as a reference, a total of 82,305 potential errors were detected, including 44,001 deletions, 27,289 insertions and 11,015 mismatched segments where part of the correct protein sequence is replaced with an alternative erroneous sequence. We then focused on the mismatched sequence errors that cause particular problems for downstream applications. A detailed characterization allowed us to identify the potential causes for the gene misprediction in approximately half (5446) of these cases. As a proof-of-concept, we also developed a simple method which allowed us to propose improved sequences for 603 primate proteins. Conclusions Gene prediction errors in primate proteomes affect up to 50% of the sequences. Major causes of errors include undetermined genome regions, genome sequencing or assembly issues, and limitations in the models used to represent gene exon–intron structures. Nevertheless, existing genome sequences can still be exploited to improve protein sequence quality. Perspectives of the work include the characterization of other types of gene prediction errors, as well as the development of a more comprehensive algorithm for protein sequence error correction.
Collapse
Affiliation(s)
- Corentin Meyer
- Department of Computer Science, ICube, CNRS, University of Strasbourg, Strasbourg, France
| | - Nicolas Scalzitti
- Department of Computer Science, ICube, CNRS, University of Strasbourg, Strasbourg, France
| | - Anne Jeannin-Girardon
- Department of Computer Science, ICube, CNRS, University of Strasbourg, Strasbourg, France
| | - Pierre Collet
- Department of Computer Science, ICube, CNRS, University of Strasbourg, Strasbourg, France
| | - Olivier Poch
- Department of Computer Science, ICube, CNRS, University of Strasbourg, Strasbourg, France
| | - Julie D Thompson
- Department of Computer Science, ICube, CNRS, University of Strasbourg, Strasbourg, France.
| |
Collapse
|
23
|
Zhang Z, Liu L, Kucukoglu M, Tian D, Larkin RM, Shi X, Zheng B. Predicting and clustering plant CLE genes with a new method developed specifically for short amino acid sequences. BMC Genomics 2020; 21:709. [PMID: 33045986 PMCID: PMC7552357 DOI: 10.1186/s12864-020-07114-8] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2020] [Accepted: 09/29/2020] [Indexed: 11/21/2022] Open
Abstract
Background The CLV3/ESR-RELATED (CLE) gene family encodes small secreted peptides (SSPs) and plays vital roles in plant growth and development by promoting cell-to-cell communication. The prediction and classification of CLE genes is challenging because of their low sequence similarity. Results We developed a machine learning-aided method for predicting CLE genes by using a CLE motif-specific residual score matrix and a novel clustering method based on the Euclidean distance of 12 amino acid residues from the CLE motif in a site-weight dependent manner. In total, 2156 CLE candidates—including 627 novel candidates—were predicted from 69 plant species. The results from our CLE motif-based clustering are consistent with previous reports using the entire pre-propeptide. Characterization of CLE candidates provided systematic statistics on protein lengths, signal peptides, relative motif positions, amino acid compositions of different parts of the CLE precursor proteins, and decisive factors of CLE prediction. The approach taken here provides information on the evolution of the CLE gene family and provides evidence that the CLE and IDA/IDL genes share a common ancestor. Conclusions Our new approach is applicable to SSPs or other proteins with short conserved domains and hence, provides a useful tool for gene prediction, classification and evolutionary analysis.
Collapse
Affiliation(s)
- Zhe Zhang
- Key Laboratory of Horticultural Plant Biology of Ministry of Education, Huazhong Agricultural University, Wuhan, 430070, China.,College of Horticulture and Forestry Sciences, Huazhong Agricultural University, Wuhan, 430070, China
| | - Lei Liu
- Key Laboratory of Horticultural Plant Biology of Ministry of Education, Huazhong Agricultural University, Wuhan, 430070, China.,College of Horticulture and Forestry Sciences, Huazhong Agricultural University, Wuhan, 430070, China
| | - Melis Kucukoglu
- Institute of Biotechnology, Helsinki Institute of Life Science (HILIFE), University of Helsinki, 00014, Helsinki, Finland.,Viikki Plant Science Centre, University of Helsinki, 00014, Helsinki, Finland
| | - Dongdong Tian
- Key Laboratory of Horticultural Plant Biology of Ministry of Education, Huazhong Agricultural University, Wuhan, 430070, China.,College of Horticulture and Forestry Sciences, Huazhong Agricultural University, Wuhan, 430070, China
| | - Robert M Larkin
- Key Laboratory of Horticultural Plant Biology of Ministry of Education, Huazhong Agricultural University, Wuhan, 430070, China.,College of Horticulture and Forestry Sciences, Huazhong Agricultural University, Wuhan, 430070, China
| | - Xueping Shi
- Key Laboratory of Horticultural Plant Biology of Ministry of Education, Huazhong Agricultural University, Wuhan, 430070, China. .,College of Horticulture and Forestry Sciences, Huazhong Agricultural University, Wuhan, 430070, China.
| | - Bo Zheng
- Key Laboratory of Horticultural Plant Biology of Ministry of Education, Huazhong Agricultural University, Wuhan, 430070, China. .,College of Horticulture and Forestry Sciences, Huazhong Agricultural University, Wuhan, 430070, China.
| |
Collapse
|
24
|
Goel N, Singh S, Aseri TC. Global sequence features based translation initiation site prediction in human genomic sequences. Heliyon 2020; 6:e04825. [PMID: 32964155 PMCID: PMC7490824 DOI: 10.1016/j.heliyon.2020.e04825] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2019] [Revised: 05/25/2020] [Accepted: 08/26/2020] [Indexed: 11/26/2022] Open
Abstract
Gene prediction has been increasingly important in genome annotation due to advancements in sequencing technology. Genome annotation further helps in determining the structure and function of these genes. Translation initiation site prediction (TIS) in human genomic sequences is one of the fundamental and essential steps in gene prediction. Thus, accurate prediction of TIS in these sequences is highly desirable. Although many computational methods were developed for this problem, none of them focused on finding these sites in human genomic sequences. In this paper, a new TIS prediction method is proposed by incorporating global sequence based features. Support vector machine is used to assess the prediction power of these features. The proposed method achieved accuracy of above 90% when tested for genomic as well as cDNA sequences. The experimental results indicate that the method works well for both genomic and cDNA sequences. The method can be integrated into gene prediction system in future.
Collapse
Affiliation(s)
- Neelam Goel
- Department of Information Technology, University Institute of Engineering and Technology, Sector-25, Panjab University, Chandigarh 160014, India
| | - Shailendra Singh
- Department of Computer Science and Engineering, Punjab Engineering College (Deemed to be University), Sector-12, Chandigarh 160012, India
| | - Trilok Chand Aseri
- Department of Computer Science and Engineering, Punjab Engineering College (Deemed to be University), Sector-12, Chandigarh 160012, India
| |
Collapse
|
25
|
Iqbal MN, Rasheed MA, Awais M, Chammam W, Kanwal S, Khan SU, Saddick S, Tlili I. BMT: Bioinformatics mini toolbox for comprehensive DNA and protein analysis. Genomics 2020; 112:4561-6. [PMID: 32791200 DOI: 10.1016/j.ygeno.2020.08.010] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2020] [Revised: 08/01/2020] [Accepted: 08/07/2020] [Indexed: 01/05/2023]
Abstract
Background Bioinformatics tools are of great significance and are used in different spheres of life sciences. There are wide variety of tools available to perform primary analysis of DNA and protein but most of them are available on different platforms and many remain undetected. Accessing these tools separately to perform individual task is uneconomical and inefficient. Objective Our aim is to bring different bioinformatics models on a single platform to ameliorate scientific research. Hence, our objective is to make a tool for comprehensive DNA and protein analysis. Methods To develop a reliable, straight-forward and standalone desktop application we used state of the art python packages and libraries. Bioinformatics Mini Toolbox (BMT) is combination of seven tools including FastqTrimmer, Gene Prediction, DNA Analysis, Translation, Protein analysis and Pairwise and Multiple alignment. Results FastqTrimmer assists in quality assurance of NGS data. Gene prediction predicts the genes by homology from novel genome on the basis of reference sequence. Protein analysis and DNA analysis calculates physiochemical properties of nucleotide and protein sequences, respectively. Translation translates the DNA sequence into six open reading frames. Pairwise alignment performs pairwise global and local alignment of DNA and protein sequences on the basis or multiple matrices. Multiple alignment aligns multiple sequences and generates a phylogenetic tree. Conclusion We developed a tool for comprehensive DNA and protein analysis. The link to download BMT is https://github.com/nasiriqbal012/BMT_SETUP.git.
Collapse
|
26
|
Ren M, Shi J, Jia J, Guo Y, Ni X, Shi T. Genotype-phenotype correlations of Berardinelli-Seip congenital lipodystrophy and novel candidate genes prediction. Orphanet J Rare Dis 2020; 15:108. [PMID: 32349771 PMCID: PMC7191718 DOI: 10.1186/s13023-020-01383-y] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2019] [Accepted: 04/13/2020] [Indexed: 11/29/2022] Open
Abstract
BACKGROUND Berardinelli-Seip congenital lipodystrophy (BSCL) is a heterogeneous autosomal recessive disorder characterized by an almost total lack of adipose tissue in the body. Mutations in the AGPAT2, BSCL2, CAV1 and PTRF genes define I-IV subtype of BSLC respectively and clinical data indicate that new causative genes remain to be discovered. Here, we retrieved 341 cases from 60 BSCL-related studies worldwide and aimed to explore genotype-phenotype correlations based on mutations of AGPAT2 and BSCL2 genes from 251 cases. We also inferred new candidate genes for BSCL through protein-protein interaction and phenotype-similarity. RESULTS Analysis results show that BSCL type II with earlier age of onset of diabetes mellitus, higher risk to suffer from premature death and mental retardation, is a more severe disorder than BSCL type I, but BSCL type I patients are more likely to have bone cysts. In BSCL type I, females are at higher risk of developing diabetes mellitus and acanthosis nigricans than males, while in BSCL type II, males suffer from diabetes mellitus earlier than females. In addition, some significant correlations among BSCL-related phenotypes were identified. New candidate genes prediction through protein-protein interaction and phenotype-similarity was conducted and we found that CAV3, EBP, SNAP29, HK1, CHRM3, OBSL1 and DNAJC13 genes could be the pathogenic factors for BSCL. Particularly, CAV3 and EBP could be high-priority candidate genes contributing to pathogenesis of BSCL. CONCLUSIONS Our study largely enhances the current knowledge of phenotypic and genotypic heterogeneity of BSCL and promotes the more comprehensive understanding of pathogenic mechanisms for BSCL.
Collapse
Affiliation(s)
- Meng Ren
- Center for Bioinformatics and Computational Biology, and the Institute of Biomedical Sciences, School of Life Sciences, East China Normal University, Shanghai, China
| | - Jingru Shi
- Center for Bioinformatics and Computational Biology, and the Institute of Biomedical Sciences, School of Life Sciences, East China Normal University, Shanghai, China
| | - Jinmeng Jia
- Center for Bioinformatics and Computational Biology, and the Institute of Biomedical Sciences, School of Life Sciences, East China Normal University, Shanghai, China
| | - Yongli Guo
- Beijing Key Laboratory for Pediatric Diseases of Otolaryngology, Head and Neck Surgery, MOE Key Laboratory of Major Diseases in Children, Beijing Children's Hospital, National Center for Children's Health, Beijing Pediatric Research Institute, Capital Medical University, Beijing, China.
- Biobank for Clinical Data and Samples in Pediatrics, Beijing Children's Hospital, National Center for Children's Health, Beijing Pediatric Research Institute, Capital Medical University, Beijing, China.
- Department of Otolaryngology, Head and Neck Surgery, Beijing Children's Hospital, National Center for Children's Health, Capital Medical University, Beijing, China.
| | - Xin Ni
- Beijing Key Laboratory for Pediatric Diseases of Otolaryngology, Head and Neck Surgery, MOE Key Laboratory of Major Diseases in Children, Beijing Children's Hospital, National Center for Children's Health, Beijing Pediatric Research Institute, Capital Medical University, Beijing, China.
- Biobank for Clinical Data and Samples in Pediatrics, Beijing Children's Hospital, National Center for Children's Health, Beijing Pediatric Research Institute, Capital Medical University, Beijing, China.
- Department of Otolaryngology, Head and Neck Surgery, Beijing Children's Hospital, National Center for Children's Health, Capital Medical University, Beijing, China.
| | - Tieliu Shi
- Center for Bioinformatics and Computational Biology, and the Institute of Biomedical Sciences, School of Life Sciences, East China Normal University, Shanghai, China.
- National Center for International Research of Biological Targeting Diagnosis and Therapy, Guangxi Key Laboratory of Biological Targeting Diagnosis and Therapy Research, Collaborative Innovation Center for Targeting Tumor Diagnosis and Therapy, Guangxi Medical University, Nanning, 530021, Guangxi, China.
| |
Collapse
|
27
|
Scalzitti N, Jeannin-Girardon A, Collet P, Poch O, Thompson JD. A benchmark study of ab initio gene prediction methods in diverse eukaryotic organisms. BMC Genomics 2020; 21:293. [PMID: 32272892 PMCID: PMC7147072 DOI: 10.1186/s12864-020-6707-9] [Citation(s) in RCA: 36] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2019] [Accepted: 03/30/2020] [Indexed: 02/02/2023] Open
Abstract
Background The draft genome assemblies produced by new sequencing technologies present important challenges for automatic gene prediction pipelines, leading to less accurate gene models. New benchmark methods are needed to evaluate the accuracy of gene prediction methods in the face of incomplete genome assemblies, low genome coverage and quality, complex gene structures, or a lack of suitable sequences for evidence-based annotations. Results We describe the construction of a new benchmark, called G3PO (benchmark for Gene and Protein Prediction PrOgrams), designed to represent many of the typical challenges faced by current genome annotation projects. The benchmark is based on a carefully validated and curated set of real eukaryotic genes from 147 phylogenetically disperse organisms, and a number of test sets are defined to evaluate the effects of different features, including genome sequence quality, gene structure complexity, protein length, etc. We used the benchmark to perform an independent comparative analysis of the most widely used ab initio gene prediction programs and identified the main strengths and weaknesses of the programs. More importantly, we highlight a number of features that could be exploited in order to improve the accuracy of current prediction tools. Conclusions The experiments showed that ab initio gene structure prediction is a very challenging task, which should be further investigated. We believe that the baseline results associated with the complex gene test sets in G3PO provide useful guidelines for future studies.
Collapse
Affiliation(s)
- Nicolas Scalzitti
- Department of Computer Science, ICube, CNRS, University of Strasbourg, Strasbourg, France
| | - Anne Jeannin-Girardon
- Department of Computer Science, ICube, CNRS, University of Strasbourg, Strasbourg, France
| | - Pierre Collet
- Department of Computer Science, ICube, CNRS, University of Strasbourg, Strasbourg, France
| | - Olivier Poch
- Department of Computer Science, ICube, CNRS, University of Strasbourg, Strasbourg, France
| | - Julie D Thompson
- Department of Computer Science, ICube, CNRS, University of Strasbourg, Strasbourg, France.
| |
Collapse
|
28
|
Gu X, Ding J, Liu W, Yang X, Yao L, Gao X, Zhang M, Yang S, Wen J. Comparative genomics and association analysis identifies virulence genes of Cercospora sojina in soybean. BMC Genomics 2020; 21:172. [PMID: 32075575 PMCID: PMC7032006 DOI: 10.1186/s12864-020-6581-5] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2019] [Accepted: 02/13/2020] [Indexed: 03/01/2023] Open
Abstract
BACKGROUND Recently, a new strain of Cercospora sojina (Race15) has been identified, which has caused the breakdown of resistance in most soybean cultivars in China. Despite this serious yield reduction, little is known about why this strain is more virulent than others. Therefore, we sequenced the Race15 genome and compared it to the Race1 genome sequence, as its virulence is significantly lower. We then re-sequenced 30 isolates of C. sojina from different regions to identifying differential virulence genes using genome-wide association analysis (GWAS). RESULTS The 40.12-Mb Race15 genome encodes 12,607 predicated genes and contains large numbers of gene clusters that have annotations in 11 different common databases. Comparative genomics revealed that although these two genomes had a large number of homologous genes, their genome structures have evolved to introduce 245 specific genes. The most important 5 candidate virulence genes were located on Contig 3 and Contig 1 and were mainly related to the regulation of metabolic mechanisms and the biosynthesis of bioactive metabolites, thereby putatively affecting fungi self-toxicity and reducing host resistance. Our study provides insight into the genomic basis of C. sojina pathogenicity and its infection mechanism, enabling future studies of this disease. CONCLUSIONS Via GWAS, we identified five candidate genes using three different methods, and these candidate genes are speculated to be related to metabolic mechanisms and the biosynthesis of bioactive metabolites. Meanwhile, Race15 specific genes may be linked with high virulence. The genes highly prevalent in virulent isolates should also be proposed as candidates, even though they were not found in our SNP analysis. Future work should focus on using a larger sample size to confirm and refine candidate gene identifications and should study the functional roles of these candidates, in order to investigate their potential roles in C. sojina pathogenicity.
Collapse
Affiliation(s)
- Xin Gu
- Department of Plant Protection, College of Agriculture, Northeast Agricultural University, Harbin, China
- Jiamusi Branch of Heilongjiang Academy of Agricultural Sciences, Jiamusi, China
| | - Junjie Ding
- Jiamusi Branch of Heilongjiang Academy of Agricultural Sciences, Jiamusi, China
| | - Wei Liu
- Jiamusi Branch of Heilongjiang Academy of Agricultural Sciences, Jiamusi, China
| | - Xiaohe Yang
- Jiamusi Branch of Heilongjiang Academy of Agricultural Sciences, Jiamusi, China
| | - Liangliang Yao
- Jiamusi Branch of Heilongjiang Academy of Agricultural Sciences, Jiamusi, China
| | - Xuedong Gao
- Jiamusi Branch of Heilongjiang Academy of Agricultural Sciences, Jiamusi, China
| | - Maoming Zhang
- Jiamusi Branch of Heilongjiang Academy of Agricultural Sciences, Jiamusi, China
| | - Shuai Yang
- Potato Research Institute, Heilongjiang Academy of Agricultural Sciences, Harbin, 150086, China
| | - Jingzhi Wen
- Department of Plant Protection, College of Agriculture, Northeast Agricultural University, Harbin, China.
| |
Collapse
|
29
|
Herndon N, Shelton J, Gerischer L, Ioannidis P, Ninova M, Dönitz J, Waterhouse RM, Liang C, Damm C, Siemanowski J, Kitzmann P, Ulrich J, Dippel S, Oberhofer G, Hu Y, Schwirz J, Schacht M, Lehmann S, Montino A, Posnien N, Gurska D, Horn T, Seibert J, Vargas Jentzsch IM, Panfilio KA, Li J, Wimmer EA, Stappert D, Roth S, Schröder R, Park Y, Schoppmeier M, Chung HR, Klingler M, Kittelmann S, Friedrich M, Chen R, Altincicek B, Vilcinskas A, Zdobnov E, Griffiths-Jones S, Ronshaugen M, Stanke M, Brown SJ, Bucher G. Enhanced genome assembly and a new official gene set for Tribolium castaneum. BMC Genomics 2020; 21:47. [PMID: 31937263 PMCID: PMC6961396 DOI: 10.1186/s12864-019-6394-6] [Citation(s) in RCA: 58] [Impact Index Per Article: 14.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2019] [Accepted: 12/12/2019] [Indexed: 12/17/2022] Open
Abstract
Background The red flour beetle Tribolium castaneum has emerged as an important model organism for the study of gene function in development and physiology, for ecological and evolutionary genomics, for pest control and a plethora of other topics. RNA interference (RNAi), transgenesis and genome editing are well established and the resources for genome-wide RNAi screening have become available in this model. All these techniques depend on a high quality genome assembly and precise gene models. However, the first version of the genome assembly was generated by Sanger sequencing, and with a small set of RNA sequence data limiting annotation quality. Results Here, we present an improved genome assembly (Tcas5.2) and an enhanced genome annotation resulting in a new official gene set (OGS3) for Tribolium castaneum, which significantly increase the quality of the genomic resources. By adding large-distance jumping library DNA sequencing to join scaffolds and fill small gaps, the gaps in the genome assembly were reduced and the N50 increased to 4753kbp. The precision of the gene models was enhanced by the use of a large body of RNA-Seq reads of different life history stages and tissue types, leading to the discovery of 1452 novel gene sequences. We also added new features such as alternative splicing, well defined UTRs and microRNA target predictions. For quality control, 399 gene models were evaluated by manual inspection. The current gene set was submitted to Genbank and accepted as a RefSeq genome by NCBI. Conclusions The new genome assembly (Tcas5.2) and the official gene set (OGS3) provide enhanced genomic resources for genetic work in Tribolium castaneum. The much improved information on transcription start sites supports transgenic and gene editing approaches. Further, novel types of information such as splice variants and microRNA target genes open additional possibilities for analysis.
Collapse
Affiliation(s)
- Nicolae Herndon
- Department of Computer Science, East Carolina University, Greenville, NC, 27858, USA
| | - Jennifer Shelton
- Division of Biology, Kansas State University, Manhattan, KS, 66506, USA
| | - Lizzy Gerischer
- Institut für Mathematik und Informatik, Universität Greifswald, Greifswald, Germany
| | - Panos Ioannidis
- Department of Genetic Medicine and Development, University of Geneva Medical School and Swiss Institute of Bioinformatics, 1211, Geneva, Switzerland
| | - Maria Ninova
- Faculty of Biology, Medicine and Health, University of Manchester, Michael Smith Building, Oxford Road, Manchester, M13 9PT, UK
| | - Jürgen Dönitz
- Department of Evolutionary Developmental Genetics, GZMB, University of Göttingen, Justus-von-Liebig-Weg 11, 37077, Göttingen, Germany
| | - Robert M Waterhouse
- Department of Ecology and Evolution, University of Lausanne and Swiss Institute of Bioinformatics, 1015, Lausanne, Switzerland
| | - Chun Liang
- Department of Biology, Miami University, Oxford, OH, 45056, USA
| | - Carsten Damm
- Institut für Informatik, Fakultät für Mathematik und Informatik, Georg-August-Universität Göttingen, Goldschmidtstr. 7, 37077, Göttingen, Germany
| | - Janna Siemanowski
- Department of Evolutionary Developmental Genetics, GZMB, University of Göttingen, Justus-von-Liebig-Weg 11, 37077, Göttingen, Germany
| | - Peter Kitzmann
- Department of Evolutionary Developmental Genetics, GZMB, University of Göttingen, Justus-von-Liebig-Weg 11, 37077, Göttingen, Germany
| | - Julia Ulrich
- Department of Evolutionary Developmental Genetics, GZMB, University of Göttingen, Justus-von-Liebig-Weg 11, 37077, Göttingen, Germany
| | - Stefan Dippel
- Göttinger Graduiertenschule fur Neurowissenschaften Biophysik und Molekulare Biowissenschaften, Georg-August-Universität Göttingen, Göttingen, Germany
| | - Georg Oberhofer
- Department of Evolutionary Developmental Genetics, GZMB, University of Göttingen, Justus-von-Liebig-Weg 11, 37077, Göttingen, Germany
| | - Yonggang Hu
- Department of Evolutionary Developmental Genetics, GZMB, University of Göttingen, Justus-von-Liebig-Weg 11, 37077, Göttingen, Germany
| | - Jonas Schwirz
- Department of Evolutionary Developmental Genetics, GZMB, University of Göttingen, Justus-von-Liebig-Weg 11, 37077, Göttingen, Germany
| | - Magdalena Schacht
- Department of Evolutionary Developmental Genetics, GZMB, University of Göttingen, Justus-von-Liebig-Weg 11, 37077, Göttingen, Germany
| | - Sabrina Lehmann
- Department of Evolutionary Developmental Genetics, GZMB, University of Göttingen, Justus-von-Liebig-Weg 11, 37077, Göttingen, Germany
| | - Alice Montino
- Department of Evolutionary Developmental Genetics, GZMB, University of Göttingen, Justus-von-Liebig-Weg 11, 37077, Göttingen, Germany
| | - Nico Posnien
- Department of Developmental Biology, GZMB, University of Göttingen, Justus-von-Liebig-Weg 11, 37077, Göttingen, Germany
| | - Daniela Gurska
- Institute for Zoology: Developmental Biology, University of Cologne, Zülpicher Str. 47b, 50674, Cologne, Germany
| | - Thorsten Horn
- Institute for Zoology: Developmental Biology, University of Cologne, Zülpicher Str. 47b, 50674, Cologne, Germany
| | - Jan Seibert
- Institute for Zoology: Developmental Biology, University of Cologne, Zülpicher Str. 47b, 50674, Cologne, Germany
| | - Iris M Vargas Jentzsch
- Institute for Zoology: Developmental Biology, University of Cologne, Zülpicher Str. 47b, 50674, Cologne, Germany
| | - Kristen A Panfilio
- School of Life Sciences, University of Warwick, Gibbet Hill Campus, Coventry, CV4 7AL, UK
| | - Jianwei Li
- Department Developmental Biology, GZMB, University of Göttingen, Justus-von-Liebig-Weg 11, 37077, Göttingen, Germany
| | - Ernst A Wimmer
- Department of Developmental Biology, University of Göttingen, Justus-von-Liebig-Weg 11, 37077, Göttingen, Germany
| | - Dominik Stappert
- Institute of Zoology: Developmental Biology, University of Cologne, Zülpicher Weg 47b, 50674, Cologne, Germany
| | - Siegfried Roth
- Institute of Zoology: Developmental Biology, University of Cologne, Zülpicher Weg 47b, 50674, Cologne, Germany
| | - Reinhard Schröder
- Institut für Biowissenschaften, Universität Rostock, Albert-Einstein-Str. 3, 18059, Rostock, Germany
| | - Yoonseong Park
- Department of Entomology, Kansas State University, Manhattan, KS, 66506, USA
| | - Michael Schoppmeier
- Department of Biology, Divison of Developmental Biology, Friedrich-Alexander-University of Erlangen-Nürnberg, Staudtstr. 5, 91058, Erlangen, Germany
| | - Ho-Ryun Chung
- Department of Computational Molecular Biology, Max-Planck-Institute for Molecular Genetics, Ihnenstraße 63-73, 14195, Berlin, Germany
| | - Martin Klingler
- Department of Biology, Division of Developmental Biology, Friedrich-Alexander-University of Erlangen-Nürnberg, Staudtstr. 5, 91058, Erlangen, Germany
| | - Sebastian Kittelmann
- Oxford Brookes University, Centre for Functional Genomics, Gipsy Lane, Oxford, OX3 0BP, UK
| | - Markus Friedrich
- Department of Anatomy and Cell Biology, Wayne State University, Detroit, MI, 48202, USA
| | - Rui Chen
- Baylor College of Medicine, Houston, Texas, USA
| | - Boran Altincicek
- Institute of Crop Science and Resource Conservation (INRES-Phytomedicine), Rheinische Friedrich-Wilhelms-University of Bonn, Bonn, Germany
| | - Andreas Vilcinskas
- Institute for Insect Biotechnology, Justus-Liebig University of Giessen, Heinrich-Buff-Ring 26-32, 35392, Giessen, Germany
| | - Evgeny Zdobnov
- Department of Genetic Medicine and Development, University of Geneva Medical School and Swiss Institute of Bioinformatics, 1211, Geneva, Switzerland
| | - Sam Griffiths-Jones
- Faculty of Biology, Medicine and Health, University of Manchester, Michael Smith Building, Oxford Road, Manchester, M13 9PT, UK
| | - Matthew Ronshaugen
- Faculty of Biology, Medicine and Health, University of Manchester, Michael Smith Building, Oxford Road, Manchester, M13 9PT, UK
| | - Mario Stanke
- Institut für Mathematik und Informatik, Universität Greifswald, Greifswald, Germany.
| | - Sue J Brown
- Division of Biology, Kansas State University, Manhattan, KS, 66506, USA.
| | - Gregor Bucher
- Georg-August-Universität Göttingen, Göttingen, Germany.
| |
Collapse
|
30
|
Abstract
Accurate gene prediction in metagenomics fragments is a computationally challenging task due to the short-read length, incomplete, and fragmented nature of the data. Most gene-prediction programs are based on extracting a large number of features and then applying statistical approaches or supervised classification approaches to predict genes. In our study, we introduce a convolutional neural network for metagenomics gene prediction (CNN-MGP) program that predicts genes in metagenomics fragments directly from raw DNA sequences, without the need for manual feature extraction and feature selection stages. CNN-MGP is able to learn the characteristics of coding and non-coding regions and distinguish coding and non-coding open reading frames (ORFs). We train 10 CNN models on 10 mutually exclusive datasets based on pre-defined GC content ranges. We extract ORFs from each fragment; then, the ORFs are encoded numerically and inputted into an appropriate CNN model based on the fragment-GC content. The output from the CNN is the probability that an ORF will encode a gene. Finally, a greedy algorithm is used to select the final gene list. Overall, CNN-MGP is effective and achieves a 91% accuracy on testing dataset. CNN-MGP shows the ability of deep learning to predict genes in metagenomics fragments, and it achieves an accuracy higher than or comparable to state-of-the-art gene-prediction programs that use pre-defined features.
Collapse
Affiliation(s)
- Amani Al-Ajlan
- Computer Science Department, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia
| | - Achraf El Allali
- Computer Science Department, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia
| |
Collapse
|
31
|
Wilbrandt J, Misof B, Panfilio KA, Niehuis O. Repertoire-wide gene structure analyses: a case study comparing automatically predicted and manually annotated gene models. BMC Genomics 2019; 20:753. [PMID: 31623555 PMCID: PMC6798390 DOI: 10.1186/s12864-019-6064-8] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2018] [Accepted: 08/27/2019] [Indexed: 02/06/2023] Open
Abstract
Background The location and modular structure of eukaryotic protein-coding genes in genomic sequences can be automatically predicted by gene annotation algorithms. These predictions are often used for comparative studies on gene structure, gene repertoires, and genome evolution. However, automatic annotation algorithms do not yet correctly identify all genes within a genome, and manual annotation is often necessary to obtain accurate gene models and gene sets. As manual annotation is time-consuming, only a fraction of the gene models in a genome is typically manually annotated, and this fraction often differs between species. To assess the impact of manual annotation efforts on genome-wide analyses of gene structural properties, we compared the structural properties of protein-coding genes in seven diverse insect species sequenced by the i5k initiative. Results Our results show that the subset of genes chosen for manual annotation by a research community (3.5–7% of gene models) may have structural properties (e.g., lengths and exon counts) that are not necessarily representative for a species’ gene set as a whole. Nonetheless, the structural properties of automatically generated gene models are only altered marginally (if at all) through manual annotation. Major correlative trends, for example a negative correlation between genome size and exonic proportion, can be inferred from either the automatically predicted or manually annotated gene models alike. Vice versa, some previously reported trends did not appear in either the automatic or manually annotated gene sets, pointing towards insect-specific gene structural peculiarities. Conclusions In our analysis of gene structural properties, automatically predicted gene models proved to be sufficiently reliable to recover the same gene-repertoire-wide correlative trends that we found when focusing on manually annotated gene models only. We acknowledge that analyses on the individual gene level clearly benefit from manual curation. However, as genome sequencing and annotation projects often differ in the extent of their manual annotation and curation efforts, our results indicate that comparative studies analyzing gene structural properties in these genomes can nonetheless be justifiable and informative. Electronic supplementary material The online version of this article (10.1186/s12864-019-6064-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Jeanne Wilbrandt
- Center for molecular Biodiversity Research, Zoological Research Museum Alexander Koenig (ZFMK), Adenauerallee 160, 53113, Bonn, Germany. .,Present address: Hoffmann Research Group, Leibniz Institute on Aging - Fritz Lipmann Institute, Beutenbergstraße 11, 07745, Jena, Germany.
| | - Bernhard Misof
- Center for molecular Biodiversity Research, Zoological Research Museum Alexander Koenig (ZFMK), Adenauerallee 160, 53113, Bonn, Germany
| | - Kristen A Panfilio
- School of Life Sciences, University of Warwick, Gibbet Hill Campus, Coventry, CV4 7AL, UK
| | - Oliver Niehuis
- Evolutionary Biology and Ecology, Institute of Biology I (Zoology), Albert Ludwig University, Hauptstr. 1, 79104, Freiburg, Germany
| |
Collapse
|
32
|
Wolf DC, Cryder Z, Gan J. Soil bacterial community dynamics following surfactant addition and bioaugmentation in pyrene-contaminated soils. Chemosphere 2019; 231:93-102. [PMID: 31128356 DOI: 10.1016/j.chemosphere.2019.05.145] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/20/2019] [Revised: 05/15/2019] [Accepted: 05/17/2019] [Indexed: 06/09/2023]
Abstract
Because of their toxic properties, polycyclic aromatic hydrocarbons (PAHs) are designated as priority pollutants. The low solubility and strong sorption of PAHs in soil often limits bioremediation. To increase PAH bioavailability and enhance microbial degradation, surfactants are often added to contaminated soils. However, the effects of surfactants on the PAH degradation capacities of soil microbes are generally neglected. In this study, 16S rRNA gene high-throughput sequencing was used to evaluate changes in the soil microbial community after the application of rhamnolipid biosurfactant or Brij-35 surfactant and Mycobacterium vanbaalenii PYR-1 bioaugmentation over a 50-d mineralization study in two soils contaminated with pyrene at 10 mg kg-1. The introduction of pyrene in both soils resulted in an increase in Firmicutes and a decrease in microbial richness and Shannon diversity index. Amendment of rhamnolipid at 1,400 μg g-1 to the native clay soil resulted in a decrease in Bacillus from 48% to 2%, which was accompanied with an increase in Mycoplana that accounted for 67% of the total genera relative abundance. Phylogenetic investigation of communities by reconstruction of unobserved states was used to predict the activity of functional genes involved in the PAH degradation KEGG pathway and determined that M. vanbaalenii PYR-1 bioaugmentation resulted in an increased number of functional genes utilized in PAH biodegradation. Results of this study provide a better understanding of the soil microbial dynamics in response to surfactant amendments in addition to bioaugmentation of a PAH-degrading microbe. This knowledge contributes to successful and efficient surfactant-enhanced bioremediation of PAH-contaminated soils.
Collapse
Affiliation(s)
- D C Wolf
- Department of Environmental Sciences, University of California, Riverside, Riverside, CA 92521, USA.
| | - Z Cryder
- Department of Environmental Sciences, University of California, Riverside, Riverside, CA 92521, USA
| | - J Gan
- Department of Environmental Sciences, University of California, Riverside, Riverside, CA 92521, USA
| |
Collapse
|
33
|
Caballero M, Wegrzyn J. gFACs: Gene Filtering, Analysis, and Conversion to Unify Genome Annotations Across Alignment and Gene Prediction Frameworks. Genomics Proteomics Bioinformatics 2019; 17:305-310. [PMID: 31437583 PMCID: PMC6818179 DOI: 10.1016/j.gpb.2019.04.002] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/24/2018] [Revised: 03/21/2019] [Accepted: 04/29/2019] [Indexed: 11/26/2022]
Abstract
Published genomes frequently contain erroneous gene models that represent issues associated with identification of open reading frames, start sites, splice sites, and related structural features. The source of these inconsistencies is often traced back to integration across text file formats designed to describe long read alignments and predicted gene structures. In addition, the majority of gene prediction frameworks do not provide robust downstream filtering to remove problematic gene annotations, nor do they represent these annotations in a format consistent with current file standards. These frameworks also lack consideration for functional attributes, such as the presence or absence of protein domains that can be used for gene model validation. To provide oversight to the increasing number of published genome annotations, we present a software package, the Gene Filtering, Analysis, and Conversion (gFACs), to filter, analyze, and convert predicted gene models and alignments. The software operates across a wide range of alignment, analysis, and gene prediction files with a flexible framework for defining gene models with reliable structural and functional attributes. gFACs supports common downstream applications, including genome browsers, and generates extensive details on the filtering process, including distributions that can be visualized to further assess the proposed gene space. gFACs is freely available and implemented in Perl with support from BioPerl libraries at https://gitlab.com/PlantGenomicsLab/gFACs.
Collapse
Affiliation(s)
- Madison Caballero
- Department of Ecology and Evolutionary Biology, University of Connecticut, Storrs, CT 06269, USA.
| | - Jill Wegrzyn
- Department of Ecology and Evolutionary Biology, University of Connecticut, Storrs, CT 06269, USA.
| |
Collapse
|
34
|
Schiavinato M, Strasser R, Mach L, Dohm JC, Himmelbauer H. Genome and transcriptome characterization of the glycoengineered Nicotiana benthamiana line ΔXT/FT. BMC Genomics 2019; 20:594. [PMID: 31324144 PMCID: PMC6642603 DOI: 10.1186/s12864-019-5960-2] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2019] [Accepted: 07/08/2019] [Indexed: 01/21/2023] Open
Abstract
BACKGROUND The allotetraploid tobacco species Nicotiana benthamiana native to Australia has become a popular host for recombinant protein production. Although its usage grows every year, little is known on this plant's genomic and transcriptomic features. Most N. benthamiana accessions currently used in research lack proper documentation of their breeding history and provenance. One of these, the glycoengineered N. benthamiana line ΔXT/FT is increasingly used for the production of biopharmaceutical proteins. RESULTS Based on an existing draft assembly of the N. benthamiana genome we predict 50,516 protein -encoding genes (62,216 transcripts) supported by expression data derived from 2.35 billion mRNA-seq reads. Using single-copy core genes we show high completeness of the predicted gene set. We functionally annotate more than two thirds of the gene set through sequence homology to genes from other Nicotiana species. We demonstrate that the expression profiles from leaf tissue of ΔXT/FT and its wild type progenitor only show minimal differences. We identify the transgene insertion sites in ΔXT/FT and show that one of the transgenes was inserted inside another predicted gene that most likely lost its function upon insertion. Based on publicly available mRNA-seq data, we confirm that the N. benthamiana accessions used by different research institutions most likely derive from a single source. CONCLUSIONS This work provides gene annotation of the N. benthamiana genome, a genomic and transcriptomic characterization of a transgenic N. benthamiana line in comparison to its wild-type progenitor, and sheds light onto the relatedness of N. benthamiana accessions that are used in laboratories around the world.
Collapse
Affiliation(s)
- Matteo Schiavinato
- Department of Biotechnology, University of Natural Resources and Life Sciences (BOKU), Muthgasse 18, 1190 Vienna, Austria
| | - Richard Strasser
- Department of Applied Genetics and Cell Biology, University of Natural Resources and Life Sciences (BOKU), Muthgasse 18, 1190 Vienna, Austria
| | - Lukas Mach
- Department of Applied Genetics and Cell Biology, University of Natural Resources and Life Sciences (BOKU), Muthgasse 18, 1190 Vienna, Austria
| | - Juliane C. Dohm
- Department of Biotechnology, University of Natural Resources and Life Sciences (BOKU), Muthgasse 18, 1190 Vienna, Austria
| | - Heinz Himmelbauer
- Department of Biotechnology, University of Natural Resources and Life Sciences (BOKU), Muthgasse 18, 1190 Vienna, Austria
| |
Collapse
|
35
|
Meher PK, Sahu TK, Gahoi S, Satpathy S, Rao AR. Evaluating the performance of sequence encoding schemes and machine learning methods for splice sites recognition. Gene 2019; 705:113-126. [PMID: 31009682 DOI: 10.1016/j.gene.2019.04.047] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2018] [Revised: 03/27/2019] [Accepted: 04/17/2019] [Indexed: 02/02/2023]
Abstract
Identification of splice sites is imperative for prediction of gene structure. Machine learning-based approaches (MLAs) have been reported to be more successful than the rule-based methods for identification of splice sites. However, the strings of alphabets should be transformed into numeric features through sequence encoding before using them as input in MLAs. In this study, we evaluated the performances of 8 different sequence encoding schemes i.e., Bayes kernel, density and sparse (DS), distribution of tri-nucleotide and 1st order Markov model (DM), frequency difference distance measure (FDDM), paired-nucleotide frequency difference between true and false sites (FDTF), 1st order Markov model (MM1), combination of both 1st and 2nd order Markov model (MM1 + MM2) and 2nd order Markov model (MM2) in respect of predicting donor and acceptor splice sites using 5 supervised learning methods (ANN, Bagging, Boosting, RF and SVM). The encoding schemes and machine learning methods were first evaluated in 4 species i.e., A. thaliana, C. elegans, D. melanogaster and H. sapiens, and then performances were validated with another four species i.e., Ciona intestinalis, Dictyostelium discoideum, Phaeodactylum tricornutum and Trypanosoma brucei. In terms of ROC (receiver-operating-characteristics) and PR (precision-recall) curves, FDTF encoding approach achieved higher accuracy followed by either MM2 or FDDM. Further, SVM was found to achieve higher accuracy (in terms of ROC and PR curves) followed by RF across encoding schemes and species. In terms of prediction accuracy across species, the SVM-FDTF combination was optimum than other combinations of classifiers and encoding schemes. Further, splice site prediction accuracies were observed higher for the species with low intron density. To our limited knowledge, this is the first attempt as far as comprehensive evaluation of sequence encoding schemes for prediction of splice sites is concerned. We have also developed an R-package EncDNA (https://cran.r-project.org/web/packages/EncDNA/index.html) for encoding of splice site motifs with different encoding schemes, which is expected to supplement the existing nucleotide sequence encoding approaches. This study is believed to be useful for the computational biologists for predicting different functional elements on the genomic DNA.
Collapse
Affiliation(s)
- Prabina Kumar Meher
- ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India.
| | - Tanmaya Kumar Sahu
- ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India
| | - Shachi Gahoi
- ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India
| | - Subhrajit Satpathy
- ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India
| | | |
Collapse
|
36
|
Kawamoto M, Jouraku A, Toyoda A, Yokoi K, Minakuchi Y, Katsuma S, Fujiyama A, Kiuchi T, Yamamoto K, Shimada T. High-quality genome assembly of the silkworm, Bombyx mori. Insect Biochem Mol Biol 2019; 107:53-62. [PMID: 30802494 DOI: 10.1016/j.ibmb.2019.02.002] [Citation(s) in RCA: 147] [Impact Index Per Article: 29.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/15/2018] [Revised: 02/13/2019] [Accepted: 02/18/2019] [Indexed: 05/21/2023]
Abstract
In 2008, the genome assembly and gene models for the domestic silkworm, Bombyx mori, were published by a Japanese and Chinese collaboration group. However, the genome assembly contains a non-negligible number of misassembled and gap regions due to the presence of many repetitive sequences within the silkworm genome. The erroneous genome assembly occasionally causes incorrect gene prediction. Here we performed hybrid assembly based on 140 × deep sequencing of long (PacBio) and short (Illumina) reads. The remaining gaps in the initial genome assembly were closed using BAC and Fosmid sequences, giving a new total length of 460.3 Mb, with 30 gap regions and an N50 comprising 16.8 Mb in scaffolds and 12.2 Mb in contigs. More RNA-seq and piRNA-seq reads were mapped on the new genome assembly compared with the previous version, indicating that the new genome assembly covers more transcribed regions, including repetitive elements. We performed gene prediction based on the new genome assembly using available mRNA and protein sequence data. The number of gene models was 16,880 with an N50 of 2154 bp. The new gene models reflected more accurate coding sequences and gene sets than old ones. The proportion of repetitive elements was also reestimated using the new genome assembly, and was calculated to be 46.8% in the silkworm genome. The new genome assembly and gene models are provided in SilkBase (http://silkbase.ab.a.u-tokyo.ac.jp).
Collapse
Affiliation(s)
- Munetaka Kawamoto
- Department of Agricultural and Environmental Biology, Graduate School of Agricultural and Life Sciences, The University of Tokyo, 1-1-1 Yayoi, Bunkyo-ku, Tokyo, 113-8657, Japan
| | - Akiya Jouraku
- Institute of Agrobiological Sciences, National Agriculture and Food Research Organization (NARO), 1-2 Owashi, Tsukuba, Ibaraki, 305-8634, Japan
| | - Atsushi Toyoda
- Comparative Genomics Laboratory, Center for Information Biology, National Institute of Genetics, Mishima, Shizuoka, 411-8540, Japan; Advanced Genomics Center, National Institute of Genetics, Mishima, Shizuoka, 411-8540, Japan
| | - Kakeru Yokoi
- Institute of Agrobiological Sciences, National Agriculture and Food Research Organization (NARO), 1-2 Owashi, Tsukuba, Ibaraki, 305-8634, Japan
| | - Yohei Minakuchi
- Comparative Genomics Laboratory, Center for Information Biology, National Institute of Genetics, Mishima, Shizuoka, 411-8540, Japan
| | - Susumu Katsuma
- Department of Agricultural and Environmental Biology, Graduate School of Agricultural and Life Sciences, The University of Tokyo, 1-1-1 Yayoi, Bunkyo-ku, Tokyo, 113-8657, Japan
| | - Asao Fujiyama
- Comparative Genomics Laboratory, Center for Information Biology, National Institute of Genetics, Mishima, Shizuoka, 411-8540, Japan; Advanced Genomics Center, National Institute of Genetics, Mishima, Shizuoka, 411-8540, Japan
| | - Takashi Kiuchi
- Department of Agricultural and Environmental Biology, Graduate School of Agricultural and Life Sciences, The University of Tokyo, 1-1-1 Yayoi, Bunkyo-ku, Tokyo, 113-8657, Japan.
| | - Kimiko Yamamoto
- Institute of Agrobiological Sciences, National Agriculture and Food Research Organization (NARO), 1-2 Owashi, Tsukuba, Ibaraki, 305-8634, Japan.
| | - Toru Shimada
- Department of Agricultural and Environmental Biology, Graduate School of Agricultural and Life Sciences, The University of Tokyo, 1-1-1 Yayoi, Bunkyo-ku, Tokyo, 113-8657, Japan.
| |
Collapse
|
37
|
Abstract
BRAKER is a pipeline for highly accurate and fully automated gene prediction in novel eukaryotic genomes. It combines two major tools: GeneMark-ES/ET and AUGUSTUS. GeneMark-ES/ET learns its parameters from a novel genomic sequence in a fully automated fashion; if available, it uses extrinsic evidence for model refinement. From the protein-coding genes predicted by GeneMark-ES/ET, we select a set for training AUGUSTUS, one of the most accurate gene finding tools that, in contrast to GeneMark-ES/ET, integrates extrinsic evidence already into the gene prediction step. The first published version, BRAKER1, integrated genomic footprints of unassembled RNA-Seq reads into the training as well as into the prediction steps. The pipeline has since been extended to the integration of data on mapped cross-species proteins, and to the usage of heterogeneous extrinsic evidence, both RNA-Seq and protein alignments. In this book chapter, we briefly summarize the pipeline methodology and describe how to apply BRAKER in environments characterized by various combinations of external evidence.
Collapse
Affiliation(s)
- Katharina J Hoff
- University of Greifswald, Institute of Mathematics and Computer Science, Greifswald, Germany.
| | - Alexandre Lomsadze
- Joint Georgia Tech and Emory University Wallace H Coulter Department of Biomedical Engineering, Atlanta, GA, USA
| | - Mark Borodovsky
- Joint Georgia Tech and Emory University Wallace H Coulter Department of Biomedical Engineering, Atlanta, GA, USA.
- School of Computational Science and Engineering, Atlanta, GA, 30332, USA.
- Moscow Institute of Physics and Technology, Dolgoprudny, Moscow Region, Russia.
| | - Mario Stanke
- University of Greifswald, Institute of Mathematics and Computer Science, Greifswald, Germany
| |
Collapse
|
38
|
Abstract
Transfer RNAs are the largest, most complex non-coding RNA family, universal to all living organisms. tRNAscan-SE has been the de facto tool for predicting tRNA genes in whole genomes. The newly developed version 2.0 has incorporated advanced methodologies with improved probabilistic search software and a suite of new gene models, enabling better functional classification of predicted genes. This chapter describes the use of the UNIX command-driven and online web versions, illustrating different search modes and options.
Collapse
Affiliation(s)
- Patricia P Chan
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Todd M Lowe
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, USA.
| |
Collapse
|
39
|
Abstract
GeMoMa is a homology-based gene prediction program that predicts gene models in target species based on gene models in evolutionary related reference species. GeMoMa utilizes amino acid sequence conservation, intron position conservation, and RNA-seq data to accurately predict protein-coding transcripts. Furthermore, GeMoMa supports the combination of predictions based on several reference species allowing to transfer high-quality annotation of different reference species to a target species. Here, we present a detailed description of GeMoMa modules and the GeMoMa pipeline and how they can be used on the command line to address particular biological problems.
Collapse
Affiliation(s)
- Jens Keilwagen
- Institute for Biosafety in Plant Biotechnology, Julius Kühn-Institut (JKI), Federal Research Centre for Cultivated Plants, Quedlinburg, Germany.
| | - Frank Hartung
- Institute for Biosafety in Plant Biotechnology, Julius Kühn-Institut (JKI), Federal Research Centre for Cultivated Plants, Quedlinburg, Germany
| | - Jan Grau
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle (Saale), Germany
| |
Collapse
|
40
|
Abstract
Comparing multiple related genomes can help to improve their structural annotation. The accuracy and consistency of the predicted exon-intron structures of the protein coding genes can be higher when considering all genomes at once rather than annotating one genome at a time.The comparative gene prediction algorithm of AUGUSTUS performs such a multi-genome annotation. A multiple alignment of genomes is used to exploit evolutionary clues to conservation and negative selection. Further, AUGUSTUS exploits the fact that orthologous genes typically have congruent exon-intron structures. Comparative AUGUSTUS simultaneously predicts the genes in all input genomes. In this chapter we walk the reader through a small example from eight vertebrate species, including the construction of an alignment of the input genomes and how to integrate RNA-Seq evidence from multiple species for gene finding.
Collapse
Affiliation(s)
- Stefanie Nachtweide
- Institute of Mathematics and Computer Science, University of Greifswald, Walther-Rathenau-Straße 47, 17487, Greifswald, Germany
| | - Mario Stanke
- Institute of Mathematics and Computer Science, University of Greifswald, Walther-Rathenau-Straße 47, 17487, Greifswald, Germany.
| |
Collapse
|
41
|
Abstract
Newly sequenced genomes are being added to the tree of life at an unprecedented fast pace. Increasingly, such new genomes are phylogenetically close to previously sequenced and annotated genomes. In other cases, whole clades of closely related species or strains ought to be annotated simultaneously. Often, in subsequent studies differences between the closely related species or strains are in the focus of research when the shared gene structures prevail. We here review methods for comparative structural genome annotation. The reviewed methods include classical approaches such as the alignment of protein sequences or protein profiles against the genome and comparative gene prediction methods that exploit a genome alignment to annotate a target genome. Newer approaches such as the simultaneous annotation of multiple genomes are also reviewed. We discuss how the methods depend on the phylogenetic placement of genomes, give advice on the choice of methods, and examine the consistency between gene structure annotations in an example. Further, we provide practical advice on genome annotation in general.
Collapse
Affiliation(s)
- Stefanie König
- Institut für Mathematik und Informatik, Ernst Moritz Arndt Universität Greifswald, Greifswald, Germany
| | - Lars Romoth
- Institut für Mathematik und Informatik, Ernst Moritz Arndt Universität Greifswald, Greifswald, Germany
| | - Mario Stanke
- Institut für Mathematik und Informatik, Ernst Moritz Arndt Universität Greifswald, Greifswald, Germany.
| |
Collapse
|
42
|
Grouzdev DS, Tourova TP, Babich TL, Shevchenko MA, Sokolova DS, Abdullin RR, Poltaraus AB, Toshchakov SV, Nazina TN. Whole-genome sequence data and analysis of type strains ' Pusillimonas nitritireducens' and ' Pusillimonas subterraneus' isolated from nitrate- and radionuclide-contaminated groundwater in Russia. Data Brief 2018; 21:882-887. [PMID: 30426040 PMCID: PMC6222257 DOI: 10.1016/j.dib.2018.10.060] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2018] [Revised: 10/11/2018] [Accepted: 10/17/2018] [Indexed: 12/01/2022] Open
Abstract
Two strains, 'Pusillimonas nitritireducens' JR1/69-2-13T and 'Pusillimonas subterraneus' JR1/69-3-13T, of aerobic, motile, Gram-negative, non-spore-forming, organotrophic, psychrotolerant bacteria were isolated from a sample of nitrate- and radionuclide-contaminated groundwater in Russia. Here we describe the draft genomes of these strains. The sequenced and annotated genome of the strain JR1/69-2-13T contained 4.3 Mbp with 4108 protein-coding genes. The genome of the strain JR1/69-3-13T contained 4.5 Mbp with 4260 protein-coding genes. Genome analysis of both strains provides an insight into the genomic basis of their resistance to nitrate, heavy metals and metalloids. The draft genome sequences of strains 'Pusillimonas nitritireducens' JR1/69-2-13T and 'Pusillimonas subterraneus' JR1/69-3-13T are available at DDBJ/EMBL/GenBank under the accession nos. https://www.ncbi.nlm.nih.gov/nuccore/PDNV00000000 and https://www.ncbi.nlm.nih.gov/nuccore/PDNW00000000, respectively.
Collapse
Affiliation(s)
- Denis S Grouzdev
- Institute of Bioengineering, Research Center of Biotechnology, Russian Academy of Sciences, Moscow, Russian Federation
| | - Tatiyana P Tourova
- Winogradsky Institute of Microbiology, Research Center of Biotechnology, Russian Academy of Sciences, Moscow, Russian Federation
| | - Tamara L Babich
- Winogradsky Institute of Microbiology, Research Center of Biotechnology, Russian Academy of Sciences, Moscow, Russian Federation
| | | | - Diyana S Sokolova
- Winogradsky Institute of Microbiology, Research Center of Biotechnology, Russian Academy of Sciences, Moscow, Russian Federation
| | - Ruslan R Abdullin
- Winogradsky Institute of Microbiology, Research Center of Biotechnology, Russian Academy of Sciences, Moscow, Russian Federation
| | - Andrey B Poltaraus
- Engelhardt Institute of Molecular Biology, Russian Academy of Sciences, Moscow, Russian Federation
| | | | - Tamara N Nazina
- Winogradsky Institute of Microbiology, Research Center of Biotechnology, Russian Academy of Sciences, Moscow, Russian Federation.,V.I. Vernadsky Institute of Geochemistry and Analytical Chemistry of Russian Academу of Sciences, Moscow, Russian Federation
| |
Collapse
|
43
|
Abstract
BACKGROUND Computational approaches, specifically machine-learning techniques, play an important role in many metagenomic analysis algorithms, such as gene prediction. Due to the large feature space, current de novo gene prediction algorithms use different combinations of classification algorithms to distinguish between coding and non-coding sequences. RESULTS In this study, we apply a filter method to select relevant features from a large set of known features instead of combining them using linear classifiers or ignoring their individual coding potential. We use minimum redundancy maximum relevance (mRMR) to select the most relevant features. Support vector machines (SVM) are trained using these features, and the classification score is transformed into the posterior probability of the coding class. A greedy algorithm uses the probability of overlapped candidate genes to select the final genes. Instead of using one model for all sequences, we train an ensemble of SVM models on mutually exclusive datasets based on GC content and use the appropriated model to classify candidate genes based on their read's GC content. CONCLUSION Our proposed algorithm achieves an improvement over some existing algorithms. mRMR produces promising results in gene prediction. It improves classification performance and feature interpretation. Our research serves as a basis for future studies on feature selection for gene prediction.
Collapse
Affiliation(s)
- Amani Al-Ajlan
- College of Computer and Information Sciences, Computer Science Department, King Saud University, Riyadh, Saudi Arabia
| | - Achraf El Allali
- College of Computer and Information Sciences, Computer Science Department, King Saud University, Riyadh, Saudi Arabia
| |
Collapse
|
44
|
Abstract
The Mongolian gerbil (Meriones unguiculatus) is a member of the rodent family that displays several features not found in mice or rats, including sensory specializations and social patterns more similar to those in humans. These features have made gerbils a valuable animal for research studies of auditory and visual processing, brain development, learning and memory, and neurological disorders. Here, we report the whole gerbil annotated genome sequence, and identify important similarities and differences to the human and mouse genomes. We further analyze the chromosomal structure of eight genes with high relevance for controlling neural signaling and demonstrate a high degree of homology between these genes in mouse and gerbil. This homology increases the likelihood that individual genes can be rapidly identified in gerbil and used for genetic manipulations. The availability of the gerbil genome provides a foundation for advancing our knowledge towards understanding evolution, behavior and neural function in mammals. ACCESSION NUMBER: The Whole Genome Shotgun sequence data from this project has been deposited at DDBJ/ENA/GenBank under the accession NHTI00000000. The version described in this paper is version NHTI01000000. The fragment reads, and mate pair reads have been deposited in the Sequence Read Archive under BioSample accession SAMN06897401.
Collapse
Affiliation(s)
- Diego A R Zorio
- Department of Biomedical Sciences, College of Medicine, Florida State University, Tallahassee, FL, USA.
| | | | - Dan H Sanes
- Center for Neural Science, New York University, New York, NY, USA
| | - Nace L Golding
- University of Texas at Austin, Department of Neuroscience, Center for Learning and Memory, Austin, TX, USA
| | - Edwin W Rubel
- Virginia Merrill Bloedel Hearing Research Center, Department of Otolaryngology-Head and Neck Surgery, University of Washington, Seattle, WA, USA
| | - Yuan Wang
- Department of Biomedical Sciences, College of Medicine, Florida State University, Tallahassee, FL, USA; Program in Neuroscience, Florida State University, Tallahassee, FL, USA.
| |
Collapse
|
45
|
Abstract
We now have access to the sequences of tens of millions of proteins. These protein sequences are essential for modern molecular biology and computational biology. The vast majority of protein sequences are derived from gene prediction tools and have no experimental supporting evidence for their translation. Despite the increasing accuracy of gene prediction tools there likely exists a large number of spurious protein predictions in the sequence databases. We have developed the Spurio tool to help identify spurious protein predictions in prokaryotes. Spurio searches the query protein sequence against a prokaryotic nucleotide database using tblastn and identifies homologous sequences. The tblastn matches are used to score the query sequence's likelihood of being a spurious protein prediction using a Gaussian process model. The most informative feature is the appearance of stop codons within the presumed translation of homologous DNA sequences. Benchmarking shows that the Spurio tool is able to distinguish spurious from true proteins. However, transposon proteins are prone to be predicted as spurious because of the frequency of degraded homologs found in the DNA sequence databases. Our initial experiments suggest that less than 1% of the proteins in the UniProtKB sequence database are likely to be spurious and that Spurio is able to identify over 60 times more spurious proteins than the AntiFam resource. The Spurio software and source code is available under an MIT license at the following URL: https://bitbucket.org/bateman-group/spurio.
Collapse
Affiliation(s)
- Wolfram Höps
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, CB10 1SD, UK
| | - Matt Jeffryes
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, CB10 1SD, UK
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, CB10 1SD, UK
| |
Collapse
|
46
|
Gschloessl B, Dorkeld F, Audiot P, Bretaudeau A, Kerdelhué C, Streiff R. De novo genome and transcriptome resources of the Adzuki bean borer Ostrinia scapulalis (Lepidoptera: Crambidae). Data Brief 2018; 17:781-787. [PMID: 29785409 PMCID: PMC5958680 DOI: 10.1016/j.dib.2018.01.073] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2017] [Revised: 01/23/2018] [Accepted: 01/25/2018] [Indexed: 11/25/2022] Open
Abstract
We present a draft genome assembly with a de novo prediction and automated functional annotation of coding genes, and a reference transcriptome of the Adzuki bean borer, Ostrinia scapulalis, based on RNA sequencing of various tissues and developmental stages. The genome assembly spans 419 Mb, has a GC content of 37.4% and includes 26,120 predicted coding genes. The reference transcriptome holds 33,080 unigenes and contains a high proportion of a set of genes conserved in eukaryotes and arthropods, used as quality assessment of the reconstructed transcripts. The new genomic and transcriptomic data presented here significantly enrich the public sequence databases for the Crambidae and Lepidoptera, and represent useful resources for future researches related to the evolution and the adaptation of phytophagous moths. The genome and transcriptome assemblies have been deposited and made accessible via a NCBI BioProject (id PRJNA390510) and the LepidoDB database (http://bipaa.genouest.org/sp/ostrinia_scapulalis/).
Collapse
Affiliation(s)
- B Gschloessl
- CBGP, INRA, CIRAD, IRD, Montpellier SupAgro, Univ Montpellier, Montpellier, France
| | - F Dorkeld
- CBGP, INRA, CIRAD, IRD, Montpellier SupAgro, Univ Montpellier, Montpellier, France
| | - P Audiot
- CBGP, INRA, CIRAD, IRD, Montpellier SupAgro, Univ Montpellier, Montpellier, France
| | - A Bretaudeau
- INRA, UMR Institut de Génétique, Environnement et Protection des Plantes (IGEPP), BioInformatics Platform for Agroecosystems Arthropods (BIPAA), Campus Beaulieu, Rennes, France.,INRIA, IRISA, GenOuest Core Facility, Campus de Beaulieu, Rennes, France
| | - C Kerdelhué
- CBGP, INRA, CIRAD, IRD, Montpellier SupAgro, Univ Montpellier, Montpellier, France
| | - R Streiff
- CBGP, INRA, CIRAD, IRD, Montpellier SupAgro, Univ Montpellier, Montpellier, France
| |
Collapse
|
47
|
Orgeur M, Martens M, Börno ST, Timmermann B, Duprez D, Stricker S. A dual transcript-discovery approach to improve the delimitation of gene features from RNA-seq data in the chicken model. Biol Open 2018; 7:bio.028498. [PMID: 29183907 PMCID: PMC5827264 DOI: 10.1242/bio.028498] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023] Open
Abstract
The sequence of the chicken genome, like several other draft genome sequences, is presently not fully covered. Gaps, contigs assigned with low confidence and uncharacterized chromosomes result in gene fragmentation and imprecise gene annotation. Transcript abundance estimation from RNA sequencing (RNA-seq) data relies on read quality, library complexity and expression normalization. In addition, the quality of the genome sequence used to map sequencing reads, and the gene annotation that defines gene features, must also be taken into account. A partially covered genome sequence causes the loss of sequencing reads from the mapping step, while an inaccurate definition of gene features induces imprecise read counts from the assignment step. Both steps can significantly bias interpretation of RNA-seq data. Here, we describe a dual transcript-discovery approach combining a genome-guided gene prediction and a de novo transcriptome assembly. This dual approach enabled us to increase the assignment rate of RNA-seq data by nearly 20% as compared to when using only the chicken reference annotation, contributing therefore to a more accurate estimation of transcript abundance. More generally, this strategy could be applied to any organism with partial genome sequence and/or lacking a manually-curated reference annotation in order to improve the accuracy of gene expression studies.
Collapse
Affiliation(s)
- Mickael Orgeur
- Freie Universität Berlin, Institut für Chemie und Biochemie, Thielallee 63, 14195 Berlin, Germany.,Max Planck Institute for Molecular Genetics, Development and Disease Group, Ihnestrasse 63-73, 14195 Berlin, Germany.,Sorbonne Universités, UPMC Univ. Paris 06, CNRS UMR 7622, Inserm U1156, IBPS-Developmental Biology Laboratory, 9 Quai Saint-Bernard, 75252 Paris Cedex 05, France
| | - Marvin Martens
- Sorbonne Universités, UPMC Univ. Paris 06, CNRS UMR 7622, Inserm U1156, IBPS-Developmental Biology Laboratory, 9 Quai Saint-Bernard, 75252 Paris Cedex 05, France
| | - Stefan T Börno
- Max Planck Institute for Molecular Genetics, Development and Disease Group, Ihnestrasse 63-73, 14195 Berlin, Germany
| | - Bernd Timmermann
- Max Planck Institute for Molecular Genetics, Development and Disease Group, Ihnestrasse 63-73, 14195 Berlin, Germany
| | - Delphine Duprez
- Sorbonne Universités, UPMC Univ. Paris 06, CNRS UMR 7622, Inserm U1156, IBPS-Developmental Biology Laboratory, 9 Quai Saint-Bernard, 75252 Paris Cedex 05, France
| | - Sigmar Stricker
- Freie Universität Berlin, Institut für Chemie und Biochemie, Thielallee 63, 14195 Berlin, Germany .,Max Planck Institute for Molecular Genetics, Development and Disease Group, Ihnestrasse 63-73, 14195 Berlin, Germany
| |
Collapse
|
48
|
Reid I. Evaluating Programs for Predicting Genes and Transcripts with RNA-Seq Support in Fungal Genomes. Methods Mol Biol 2018; 1775:209-227. [PMID: 29876820 DOI: 10.1007/978-1-4939-7804-5_17] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
The steps needed to computationally predict genes and transcripts in fungal genomes with support from RNA-Seq data are described in detail for three prediction programs: CodingQuarry, BRAKER1, and Harfang. These programs predicted from 86% to 92% (Harfang) of the genes in a manually curated reference set for Aspergillus niger strain NRRL3. Genes with little or no RNA-Seq read coverage were predicted less successfully than genes with adequate coverage.
Collapse
Affiliation(s)
- Ian Reid
- Centre for Structural and Functional Genomics, Concordia University, Montreal, QC, Canada.
| |
Collapse
|
49
|
Abstract
Neuropeptides and peptide hormones are signaling molecules produced via complex post-translational modifications of precursor proteins known as prohormones. Neuropeptides activate specific receptors and are associated with the regulation of physiological systems and behaviors. The identification of prohormones-and the neuropeptides created by these prohormones-from genomic assemblies has become essential to support the annotation and use of the rapidly growing number of sequenced genomes. Here we describe a methodology for identifying the prohormone complement from genomic assemblies that employs widely available public toolsets and databases. The uncovered prohormone sequences can then be screened for putative neuropeptides to enable accurate proteomic discovery and validation.
Collapse
Affiliation(s)
- Bruce R Southey
- Department of Animal Sciences, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | - Elena V Romanova
- Department of Chemistry and Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | - Sandra L Rodriguez-Zas
- Department of Animal Sciences, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | - Jonathan V Sweedler
- Department of Chemistry and Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana-Champaign, Urbana, IL, USA.
| |
Collapse
|
50
|
Abstract
The term "genome annotation" includes identification of protein-coding and noncoding sequences (e.g., repeats, rDNA, and ncRNA) in genome assemblies and attaching functional information (metadata) to these annotated features. Here, we describe the basic outline of fungal nuclear and mitochondrial genome annotation as performed at the US Department of Energy Joint Genome Institute (JGI).
Collapse
Affiliation(s)
- Sajeet Haridas
- United States Department of Energy Joint Genome Institute, Walnut Creek, CA, USA
| | - Asaf Salamov
- United States Department of Energy Joint Genome Institute, Walnut Creek, CA, USA
| | - Igor V Grigoriev
- United States Department of Energy Joint Genome Institute, Walnut Creek, CA, USA.
| |
Collapse
|