1
|
Zhang J, Jiang Q, Du Z, Geng Y, Hu Y, Tong Q, Song Y, Zhang HY, Yan X, Feng Z. Knowledge graph-derived feed efficiency analysis via pig gut microbiota. Sci Rep 2024; 14:13939. [PMID: 38886444 PMCID: PMC11182767 DOI: 10.1038/s41598-024-64835-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2023] [Accepted: 06/13/2024] [Indexed: 06/20/2024] Open
Abstract
Feed efficiency (FE) is essential for pig production, has been reported to be partially explained by gut microbiota. Despite an extensive body of research literature to this topic, studies regarding the regulation of feed efficiency by gut microbiota remain fragmented and mostly confined to disorganized or semi-structured unrestricted texts. Meanwhile, structured databases for microbiota analysis are available, yet they often lack a comprehensive understanding of the associated biological processes. Therefore, we have devised an approach to construct a comprehensive knowledge graph by combining unstructured textual intelligence with structured database information and applied it to investigate the relationship between pig gut microbes and FE. Firstly, we created the pgmReading knowledge base and the domain ontology of pig gut microbiota by annotating, extracting, and integrating semantic information from 157 scientific publications. Secondly, we created the pgmPubtator by utilizing PubTator to expand the semantic information related to microbiota. Thirdly, we created the pgmDatabase by mapping and combining the ADDAGMA, gutMGene, and KEGG databases based on the ontology. These three knowledge bases were integrated to form the Pig Gut Microbial Knowledge Graph (PGMKG). Additionally, we created five biological query cases to validate the performance of PGMKG. These cases not only allow us to identify microbes with the most significant impact on FE but also provide insights into the metabolites produced by these microbes and the associated metabolic pathways. This study introduces PGMKG, mapping key microbes in pig feed efficiency and guiding microbiota-targeted optimization.
Collapse
Affiliation(s)
- Junmei Zhang
- National Key Laboratory of Agricultural Microbiology, College of Informatics, College of Animal Sciences and Technology, College of Veterinary Medicine, Huazhong Agricultural University, Wuhan, 430070, China
| | - Qin Jiang
- National Key Laboratory of Agricultural Microbiology, College of Informatics, College of Animal Sciences and Technology, College of Veterinary Medicine, Huazhong Agricultural University, Wuhan, 430070, China
- Yazhouwan National Laboratory (YNL), Sanya, 572025, China
| | - Zhihong Du
- National Key Laboratory of Agricultural Microbiology, College of Informatics, College of Animal Sciences and Technology, College of Veterinary Medicine, Huazhong Agricultural University, Wuhan, 430070, China
| | - Yilin Geng
- National Key Laboratory of Agricultural Microbiology, College of Informatics, College of Animal Sciences and Technology, College of Veterinary Medicine, Huazhong Agricultural University, Wuhan, 430070, China
| | - Yuren Hu
- National Key Laboratory of Agricultural Microbiology, College of Informatics, College of Animal Sciences and Technology, College of Veterinary Medicine, Huazhong Agricultural University, Wuhan, 430070, China
| | - Qichang Tong
- National Key Laboratory of Agricultural Microbiology, College of Informatics, College of Animal Sciences and Technology, College of Veterinary Medicine, Huazhong Agricultural University, Wuhan, 430070, China
| | - Yunfeng Song
- National Key Laboratory of Agricultural Microbiology, College of Informatics, College of Animal Sciences and Technology, College of Veterinary Medicine, Huazhong Agricultural University, Wuhan, 430070, China
| | - Hong-Yu Zhang
- National Key Laboratory of Agricultural Microbiology, College of Informatics, College of Animal Sciences and Technology, College of Veterinary Medicine, Huazhong Agricultural University, Wuhan, 430070, China
| | - Xianghua Yan
- National Key Laboratory of Agricultural Microbiology, College of Informatics, College of Animal Sciences and Technology, College of Veterinary Medicine, Huazhong Agricultural University, Wuhan, 430070, China
| | - Zaiwen Feng
- National Key Laboratory of Agricultural Microbiology, College of Informatics, College of Animal Sciences and Technology, College of Veterinary Medicine, Huazhong Agricultural University, Wuhan, 430070, China.
| |
Collapse
|
2
|
Nédellec C, Sauvion C, Bossy R, Borovikova M, Deléger L. TaeC: A manually annotated text dataset for trait and phenotype extraction and entity linking in wheat breeding literature. PLoS One 2024; 19:e0305475. [PMID: 38870159 PMCID: PMC11175518 DOI: 10.1371/journal.pone.0305475] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2023] [Accepted: 05/31/2024] [Indexed: 06/15/2024] Open
Abstract
Wheat varieties show a large diversity of traits and phenotypes. Linking them to genetic variability is essential for shorter and more efficient wheat breeding programs. A growing number of plant molecular information networks provide interlinked interoperable data to support the discovery of gene-phenotype interactions. A large body of scientific literature and observational data obtained in-field and under controlled conditions document wheat breeding experiments. The cross-referencing of this complementary information is essential. Text from databases and scientific publications has been identified early on as a relevant source of information. However, the wide variety of terms used to refer to traits and phenotype values makes it difficult to find and cross-reference the textual information, e.g. simple dictionary lookup methods miss relevant terms. Corpora with manually annotated examples are thus needed to evaluate and train textual information extraction methods. While several corpora contain annotations of human and animal phenotypes, no corpus is available for plant traits. This hinders the evaluation of text mining-based crop knowledge graphs (e.g. AgroLD, KnetMiner, WheatIS-FAIDARE) and limits the ability to train machine learning methods and improve the quality of information. The Triticum aestivum trait Corpus is a new gold standard for traits and phenotypes of wheat. It consists of 528 PubMed references that are fully annotated by trait, phenotype, and species. We address the interoperability challenge of crossing sparse assay data and publications by using the Wheat Trait and Phenotype Ontology to normalize trait mentions and the species taxonomy of the National Center for Biotechnology Information to normalize species. The paper describes the construction of the corpus. A study of the performance of state-of-the-art language models for both named entity recognition and linking tasks trained on the corpus shows that it is suitable for training and evaluation. This corpus is currently the most comprehensive manually annotated corpus for natural language processing studies on crop phenotype information from the literature.
Collapse
Affiliation(s)
- Claire Nédellec
- Université Paris-Saclay, INRAE, MaIAGE, Jouy-en-Josas, France
| | - Clara Sauvion
- Université Paris-Saclay, INRAE, MaIAGE, Jouy-en-Josas, France
| | - Robert Bossy
- Université Paris-Saclay, INRAE, MaIAGE, Jouy-en-Josas, France
| | - Mariya Borovikova
- Université Paris-Saclay, INRAE, MaIAGE, Jouy-en-Josas, France
- TETIS, Univ. Montpellier, AgroParisTech, CIRAD, CNRS, INRAE, Montpellier, France
| | - Louise Deléger
- Université Paris-Saclay, INRAE, MaIAGE, Jouy-en-Josas, France
| |
Collapse
|
3
|
Shrestha AMS, Gonzales MEM, Ong PCL, Larmande P, Lee HS, Jeung JU, Kohli A, Chebotarov D, Mauleon RP, Lee JS, McNally KL. RicePilaf: a post-GWAS/QTL dashboard to integrate pangenomic, coexpression, regulatory, epigenomic, ontology, pathway, and text-mining information to provide functional insights into rice QTLs and GWAS loci. Gigascience 2024; 13:giae013. [PMID: 38832465 PMCID: PMC11148593 DOI: 10.1093/gigascience/giae013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2023] [Revised: 02/21/2024] [Accepted: 03/12/2024] [Indexed: 06/05/2024] Open
Abstract
BACKGROUND As the number of genome-wide association study (GWAS) and quantitative trait locus (QTL) mappings in rice continues to grow, so does the already long list of genomic loci associated with important agronomic traits. Typically, loci implicated by GWAS/QTL analysis contain tens to hundreds to thousands of single-nucleotide polmorphisms (SNPs)/genes, not all of which are causal and many of which are in noncoding regions. Unraveling the biological mechanisms that tie the GWAS regions and QTLs to the trait of interest is challenging, especially since it requires collating functional genomics information about the loci from multiple, disparate data sources. RESULTS We present RicePilaf, a web app for post-GWAS/QTL analysis, that performs a slew of novel bioinformatics analyses to cross-reference GWAS results and QTL mappings with a host of publicly available rice databases. In particular, it integrates (i) pangenomic information from high-quality genome builds of multiple rice varieties, (ii) coexpression information from genome-scale coexpression networks, (iii) ontology and pathway information, (iv) regulatory information from rice transcription factor databases, (v) epigenomic information from multiple high-throughput epigenetic experiments, and (vi) text-mining information extracted from scientific abstracts linking genes and traits. We demonstrate the utility of RicePilaf by applying it to analyze GWAS peaks of preharvest sprouting and genes underlying yield-under-drought QTLs. CONCLUSIONS RicePilaf enables rice scientists and breeders to shed functional light on their GWAS regions and QTLs, and it provides them with a means to prioritize SNPs/genes for further experiments. The source code, a Docker image, and a demo version of RicePilaf are publicly available at https://github.com/bioinfodlsu/rice-pilaf.
Collapse
Affiliation(s)
- Anish M S Shrestha
- Bioinformatics Lab, Advanced Research Institute for Informatics, Computing and Networking, College of Computer Studies, De La Salle University, Manila 1004, Philippines
- International Rice Research Institute (IRRI), Metro Manila 1301, Philippines
| | - Mark Edward M Gonzales
- Bioinformatics Lab, Advanced Research Institute for Informatics, Computing and Networking, College of Computer Studies, De La Salle University, Manila 1004, Philippines
| | - Phoebe Clare L Ong
- Bioinformatics Lab, Advanced Research Institute for Informatics, Computing and Networking, College of Computer Studies, De La Salle University, Manila 1004, Philippines
| | - Pierre Larmande
- DIADE, Univ Montpellier, Cirad, IRD, 34394 Montpellier, France
| | - Hyun-Sook Lee
- National Institute of Crop Science, Wanju-gun 55365, Republic of Korea
| | - Ji-Ung Jeung
- National Institute of Crop Science, Wanju-gun 55365, Republic of Korea
| | - Ajay Kohli
- International Rice Research Institute (IRRI), Metro Manila 1301, Philippines
| | - Dmytro Chebotarov
- International Rice Research Institute (IRRI), Metro Manila 1301, Philippines
| | - Ramil P Mauleon
- International Rice Research Institute (IRRI), Metro Manila 1301, Philippines
| | - Jae-Sung Lee
- International Rice Research Institute (IRRI), Metro Manila 1301, Philippines
| | - Kenneth L McNally
- International Rice Research Institute (IRRI), Metro Manila 1301, Philippines
| |
Collapse
|
4
|
Dumschott K, Dörpholz H, Laporte MA, Brilhaus D, Schrader A, Usadel B, Neumann S, Arnaud E, Kranz A. Ontologies for increasing the FAIRness of plant research data. FRONTIERS IN PLANT SCIENCE 2023; 14:1279694. [PMID: 38098789 PMCID: PMC10720748 DOI: 10.3389/fpls.2023.1279694] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/18/2023] [Accepted: 11/15/2023] [Indexed: 12/17/2023]
Abstract
The importance of improving the FAIRness (findability, accessibility, interoperability, reusability) of research data is undeniable, especially in the face of large, complex datasets currently being produced by omics technologies. Facilitating the integration of a dataset with other types of data increases the likelihood of reuse, and the potential of answering novel research questions. Ontologies are a useful tool for semantically tagging datasets as adding relevant metadata increases the understanding of how data was produced and increases its interoperability. Ontologies provide concepts for a particular domain as well as the relationships between concepts. By tagging data with ontology terms, data becomes both human- and machine- interpretable, allowing for increased reuse and interoperability. However, the task of identifying ontologies relevant to a particular research domain or technology is challenging, especially within the diverse realm of fundamental plant research. In this review, we outline the ontologies most relevant to the fundamental plant sciences and how they can be used to annotate data related to plant-specific experiments within metadata frameworks, such as Investigation-Study-Assay (ISA). We also outline repositories and platforms most useful for identifying applicable ontologies or finding ontology terms.
Collapse
Affiliation(s)
- Kathryn Dumschott
- Institute of Bio- and Geosciences (IBG-4: Bioinformatics) & Bioeconomy Science Center (BioSC), CEPLAS, Forschungszentrum Jülich, Jülich, Germany
| | - Hannah Dörpholz
- Institute of Bio- and Geosciences (IBG-4: Bioinformatics) & Bioeconomy Science Center (BioSC), CEPLAS, Forschungszentrum Jülich, Jülich, Germany
| | - Marie-Angélique Laporte
- Digital Solutions Team, Digital Inclusion Lever, Bioversity International, Montpellier Office, Montpellier, France
| | - Dominik Brilhaus
- Data Science and Management & Cluster of Excellence on Plant Sciences (CEPLAS), Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| | - Andrea Schrader
- Data Science and Management & Cluster of Excellence on Plant Sciences (CEPLAS), University of Cologne, Cologne, Germany
| | - Björn Usadel
- Institute of Bio- and Geosciences (IBG-4: Bioinformatics) & Bioeconomy Science Center (BioSC), CEPLAS, Forschungszentrum Jülich, Jülich, Germany
- Institute for Biological Data Science & Cluster of Excellence on Plant Sciences (CEPLAS), Faculty of Mathematics and Life Sciences, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| | - Steffen Neumann
- Program Center MetaCom, Leibniz Institute of Plant Biochemistry, Halle, Germany
- German Centre for Integrative Biodiversity Research (iDiv), Halle-Jena-Leipzig, Germany
| | - Elizabeth Arnaud
- Digital Solutions Team, Digital Inclusion Lever, Bioversity International, Montpellier Office, Montpellier, France
| | - Angela Kranz
- Institute of Bio- and Geosciences (IBG-4: Bioinformatics) & Bioeconomy Science Center (BioSC), CEPLAS, Forschungszentrum Jülich, Jülich, Germany
| |
Collapse
|
5
|
Imbert B, Kreplak J, Flores RG, Aubert G, Burstin J, Tayeh N. Development of a knowledge graph framework to ease and empower translational approaches in plant research: a use-case on grain legumes. Front Artif Intell 2023; 6:1191122. [PMID: 37601035 PMCID: PMC10435283 DOI: 10.3389/frai.2023.1191122] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2023] [Accepted: 07/10/2023] [Indexed: 08/22/2023] Open
Abstract
While the continuing decline in genotyping and sequencing costs has largely benefited plant research, some key species for meeting the challenges of agriculture remain mostly understudied. As a result, heterogeneous datasets for different traits are available for a significant number of these species. As gene structures and functions are to some extent conserved through evolution, comparative genomics can be used to transfer available knowledge from one species to another. However, such a translational research approach is complex due to the multiplicity of data sources and the non-harmonized description of the data. Here, we provide two pipelines, referred to as structural and functional pipelines, to create a framework for a NoSQL graph-database (Neo4j) to integrate and query heterogeneous data from multiple species. We call this framework Orthology-driven knowledge base framework for translational research (Ortho_KB). The structural pipeline builds bridges across species based on orthology. The functional pipeline integrates biological information, including QTL, and RNA-sequencing datasets, and uses the backbone from the structural pipeline to connect orthologs in the database. Queries can be written using the Neo4j Cypher language and can, for instance, lead to identify genes controlling a common trait across species. To explore the possibilities offered by such a framework, we populated Ortho_KB to obtain OrthoLegKB, an instance dedicated to legumes. The proposed model was evaluated by studying the conservation of a flowering-promoting gene. Through a series of queries, we have demonstrated that our knowledge graph base provides an intuitive and powerful platform to support research and development programmes.
Collapse
Affiliation(s)
- Baptiste Imbert
- Agroécologie, INRAE, Institut Agro, Univ. Bourgogne, Univ. Bourgogne Franche-Comté, Dijon, France
| | - Jonathan Kreplak
- Agroécologie, INRAE, Institut Agro, Univ. Bourgogne, Univ. Bourgogne Franche-Comté, Dijon, France
| | - Raphaël-Gauthier Flores
- Université Paris-Saclay, INRAE, URGI, Versailles, France
- Université Paris-Saclay, INRAE, BioinfOmics, Plant Bioinformatics Facility, Versailles, France
| | - Grégoire Aubert
- Agroécologie, INRAE, Institut Agro, Univ. Bourgogne, Univ. Bourgogne Franche-Comté, Dijon, France
| | - Judith Burstin
- Agroécologie, INRAE, Institut Agro, Univ. Bourgogne, Univ. Bourgogne Franche-Comté, Dijon, France
| | - Nadim Tayeh
- Agroécologie, INRAE, Institut Agro, Univ. Bourgogne, Univ. Bourgogne Franche-Comté, Dijon, France
| |
Collapse
|
6
|
Grimplet J. Genomic and Bioinformatic Resources for Perennial Fruit Species. Curr Genomics 2022; 23:217-233. [PMID: 36777875 PMCID: PMC9875543 DOI: 10.2174/1389202923666220428102632] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2021] [Revised: 03/12/2022] [Accepted: 03/12/2022] [Indexed: 11/22/2022] Open
Abstract
In the post-genomic era, data management and development of bioinformatic tools are critical for the adequate exploitation of genomics data. In this review, we address the actual situation for the subset of crops represented by the perennial fruit species. The agronomical singularity of these species compared to plant and crop model species provides significant challenges on the implementation of good practices generally not addressed in other species. Studies are usually performed over several years in non-controlled environments, usage of rootstock is common, and breeders heavily rely on vegetative propagation. A reference genome is now available for all the major species as well as many members of the economically important genera for breeding purposes. Development of pangenome for these species is beginning to gain momentum which will require a substantial effort in term of bioinformatic tool development. The available tools for genome annotation and functional analysis will also be presented.
Collapse
Affiliation(s)
- Jérôme Grimplet
- Centro de Investigación y Tecnología Agroalimentaria de Aragón (CITA), Unidad de Hortofruticultura, Gobierno de Aragón, Avda. Montañana, Zaragoza, Spain
- Instituto Agroalimentario de Aragón–IA2 (CITA-Universidad de Zaragoza), Calle Miguel Servet, Zaragoza, Spain
| |
Collapse
|
7
|
Larmande P, Tagny Ngompe G, Venkatesan A, Ruiz M. AgroLD: A Knowledge Graph Database for Plant Functional Genomics. Methods Mol Biol 2022; 2443:527-540. [PMID: 35037225 DOI: 10.1007/978-1-0716-2067-0_28] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Recent advances in high-throughput technologies have resulted in tremendous increase in the amount of data in the agronomic domain. There is an urgent need to effectively integrate complementary information to understand the biological system in its entirety. We have developed AgroLD, a knowledge graph that exploits the Semantic Web technology and some of the relevant standard domain ontologies, to integrate information on plant species and in this way facilitating the formulation of new scientific hypotheses. This chapter outlines some integration results of the project, which initially focused on genomics, proteomics and phenomics.
Collapse
Affiliation(s)
- Pierre Larmande
- DIADE, IRD, CIRAD, Univ. Montpellier, Montpellier, France.
- French Institute of Bioinformatics (IFB)-South Green Bioinformatics Platform, Bioversity, CIRAD, INRAE, IRD, Montpellier, France.
| | - Gildas Tagny Ngompe
- French Institute of Bioinformatics (IFB)-South Green Bioinformatics Platform, Bioversity, CIRAD, INRAE, IRD, Montpellier, France
- AGAP, CIRAD, INRAE, Univ. Montpellier, av Agropolis, Montpellier, France
| | | | - Manuel Ruiz
- French Institute of Bioinformatics (IFB)-South Green Bioinformatics Platform, Bioversity, CIRAD, INRAE, IRD, Montpellier, France
- AGAP, CIRAD, INRAE, Univ. Montpellier, av Agropolis, Montpellier, France
| |
Collapse
|
8
|
Gardiner LJ, Krishna R. Bluster or Lustre: Can AI Improve Crops and Plant Health? PLANTS (BASEL, SWITZERLAND) 2021; 10:plants10122707. [PMID: 34961177 PMCID: PMC8707749 DOI: 10.3390/plants10122707] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/03/2021] [Revised: 11/24/2021] [Accepted: 12/06/2021] [Indexed: 06/14/2023]
Abstract
In a changing climate where future food security is a growing concern, researchers are exploring new methods and technologies in the effort to meet ambitious crop yield targets. The application of Artificial Intelligence (AI) including Machine Learning (ML) methods in this area has been proposed as a potential mechanism to support this. This review explores current research in the area to convey the state-of-the-art as to how AI/ML have been used to advance research, gain insights, and generally enable progress in this area. We address the question-Can AI improve crops and plant health? We further discriminate the bluster from the lustre by identifying the key challenges that AI has been shown to address, balanced with the potential issues with its usage, and the key requisites for its success. Overall, we hope to raise awareness and, as a result, promote usage, of AI related approaches where they can have appropriate impact to improve practices in agricultural and plant sciences.
Collapse
|
9
|
Valentin G, Abdel T, Gaëtan D, Jean-François D, Matthieu C, Mathieu R. GreenPhylDB v5: a comparative pangenomic database for plant genomes. Nucleic Acids Res 2021; 49:D1464-D1471. [PMID: 33237299 PMCID: PMC7779052 DOI: 10.1093/nar/gkaa1068] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2020] [Revised: 10/19/2020] [Accepted: 10/21/2020] [Indexed: 12/28/2022] Open
Abstract
Comparative genomics is the analysis of genomic relationships among different species and serves as a significant base for evolutionary and functional genomic studies. GreenPhylDB (https://www.greenphyl.org) is a database designed to facilitate the exploration of gene families and homologous relationships among plant genomes, including staple crops critically important for global food security. GreenPhylDB is available since 2007, after the release of the Arabidopsis thaliana and Oryza sativa genomes and has undergone multiple releases. With the number of plant genomes currently available, it becomes challenging to select a single reference for comparative genomics studies but there is still a lack of databases taking advantage several genomes by species for orthology detection. GreenPhylDBv5 introduces the concept of comparative pangenomics by harnessing multiple genome sequences by species. We created 19 pangenes and processed them with other species still relying on one genome. In total, 46 plant species were considered to build gene families and predict their homologous relationships through phylogenetic-based analyses. In addition, since the previous publication, we rejuvenated the website and included a new set of original tools including protein-domain combination, tree topologies searches and a section for users to store their own results in order to support community curation efforts.
Collapse
Affiliation(s)
- Guignon Valentin
- Bioversity International, Parc Scientifique Agropolis II, 34397 Montpellier, France
- French Institute of Bioinformatics (IFB)—South Green Bioinformatics Platform, Bioversity, CIRAD, INRAE, IRD, F-34398 Montpellier France
| | - Toure Abdel
- Syngenta Seeds SAS, 31790 Saint-Sauveur France
| | - Droc Gaëtan
- French Institute of Bioinformatics (IFB)—South Green Bioinformatics Platform, Bioversity, CIRAD, INRAE, IRD, F-34398 Montpellier France
- AGAP, Univ de Montpellier, CIRAD, INRAE, Montpellier SupAgro, F-34398 Montpellier, France
- CIRAD, UMR AGAP, F-34398 Montpellier, France
| | - Dufayard Jean-François
- French Institute of Bioinformatics (IFB)—South Green Bioinformatics Platform, Bioversity, CIRAD, INRAE, IRD, F-34398 Montpellier France
- AGAP, Univ de Montpellier, CIRAD, INRAE, Montpellier SupAgro, F-34398 Montpellier, France
- CIRAD, UMR AGAP, F-34398 Montpellier, France
| | | | - Rouard Mathieu
- Bioversity International, Parc Scientifique Agropolis II, 34397 Montpellier, France
- French Institute of Bioinformatics (IFB)—South Green Bioinformatics Platform, Bioversity, CIRAD, INRAE, IRD, F-34398 Montpellier France
| |
Collapse
|
10
|
Thessen AE, Walls RL, Vogt L, Singer J, Warren R, Buttigieg PL, Balhoff JP, Mungall CJ, McGuinness DL, Stucky BJ, Yoder MJ, Haendel MA. Transforming the study of organisms: Phenomic data models and knowledge bases. PLoS Comput Biol 2020; 16:e1008376. [PMID: 33232313 PMCID: PMC7685442 DOI: 10.1371/journal.pcbi.1008376] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023] Open
Abstract
The rapidly decreasing cost of gene sequencing has resulted in a deluge of genomic data from across the tree of life; however, outside a few model organism databases, genomic data are limited in their scientific impact because they are not accompanied by computable phenomic data. The majority of phenomic data are contained in countless small, heterogeneous phenotypic data sets that are very difficult or impossible to integrate at scale because of variable formats, lack of digitization, and linguistic problems. One powerful solution is to represent phenotypic data using data models with precise, computable semantics, but adoption of semantic standards for representing phenotypic data has been slow, especially in biodiversity and ecology. Some phenotypic and trait data are available in a semantic language from knowledge bases, but these are often not interoperable. In this review, we will compare and contrast existing ontology and data models, focusing on nonhuman phenotypes and traits. We discuss barriers to integration of phenotypic data and make recommendations for developing an operationally useful, semantically interoperable phenotypic data ecosystem.
Collapse
Affiliation(s)
- Anne E. Thessen
- Environmental and Molecular Toxicology, Oregon State University, Corvallis, Oregon, United States of America
- Ronin Institute for Independent Scholarship, Monclair, New Jersey, United States of America
| | - Ramona L. Walls
- Bio5 Institute, University of Arizona, Tucson, Arizona, United States of America
| | - Lars Vogt
- TIB Leibniz Information Centre for Science and Technology, Hannover, Germany
| | | | | | - Pier Luigi Buttigieg
- Alfred-Wegener-Institut, Helmholtz-Zentrum für Polar- und Meeresforschung, Bremerhaven, Germany
| | - James P. Balhoff
- Renaissance Computing Institute, University of North Carolina, Chapel Hill, North Carolina, United States of America
| | - Christopher J. Mungall
- Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, California, United States of America
| | | | - Brian J. Stucky
- Florida Museum of Natural History, University of Florida, Gainesville, Florida, United States of America
| | - Matthew J. Yoder
- Illinois Natural History Survey, Champaign, Illinois, United States of America
| | - Melissa A. Haendel
- Environmental and Molecular Toxicology, Oregon State University, Corvallis, Oregon, United States of America
| |
Collapse
|
11
|
Larmande P, Do H, Wang Y. OryzaGP: rice gene and protein dataset for named-entity recognition. Genomics Inform 2019; 17:e17. [PMID: 31307132 PMCID: PMC6808627 DOI: 10.5808/gi.2019.17.2.e17] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2018] [Accepted: 05/30/2019] [Indexed: 11/20/2022] Open
Abstract
Text mining has become an important research method in biology, with its original purpose to extract biological entities, such as genes, proteins and phenotypic traits, to extend knowledge from scientific papers. However, few thorough studies on text mining and application development, for plant molecular biology data, have been performed, especially for rice, resulting in a lack of datasets available to solve named-entity recognition tasks for this species. Since there are rare benchmarks available for rice, we faced various difficulties in exploiting advanced machine learning methods for accurate analysis of the rice literature. To evaluate several approaches to automatically extract information from gene/protein entities, we built a new dataset for rice as a benchmark. This dataset is composed of a set of titles and abstracts, extracted from scientific papers focusing on the rice species, and is downloaded from PubMed. During the 5th Biomedical Linked Annotation Hackathon, a portion of the dataset was uploaded to PubAnnotation for sharing. Our ultimate goal is to offer a shared task of rice gene/protein name recognition through the BioNLP Open Shared Tasks framework using the dataset, to facilitate an open comparison and evaluation of different approaches to the task.
Collapse
Affiliation(s)
- Pierre Larmande
- UMR DIADE, Institute of Research for Sustainable Development (IRD), F-34394 Montpellier, France.,ICT Lab, University of Science and Technology of Hanoi (USTH), 100000 Hanoi, Vietnam
| | - Huy Do
- ICT Lab, University of Science and Technology of Hanoi (USTH), 100000 Hanoi, Vietnam
| | - Yue Wang
- Database Center for Life Science (DBCLS), Chiba 277-0871, Japan
| |
Collapse
|