1
|
Goudey B, Geard N, Verspoor K, Zobel J. Propagation, detection and correction of errors using the sequence database network. Brief Bioinform 2022; 23:6764545. [PMID: 36266246 PMCID: PMC9677457 DOI: 10.1093/bib/bbac416] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2022] [Revised: 07/31/2022] [Accepted: 08/28/2022] [Indexed: 12/14/2022] Open
Abstract
Nucleotide and protein sequences stored in public databases are the cornerstone of many bioinformatics analyses. The records containing these sequences are prone to a wide range of errors, including incorrect functional annotation, sequence contamination and taxonomic misclassification. One source of information that can help to detect errors are the strong interdependency between records. Novel sequences in one database draw their annotations from existing records, may generate new records in multiple other locations and will have varying degrees of similarity with existing records across a range of attributes. A network perspective of these relationships between sequence records, within and across databases, offers new opportunities to detect-or even correct-erroneous entries and more broadly to make inferences about record quality. Here, we describe this novel perspective of sequence database records as a rich network, which we call the sequence database network, and illustrate the opportunities this perspective offers for quantification of database quality and detection of spurious entries. We provide an overview of the relevant databases and describe how the interdependencies between sequence records across these databases can be exploited by network analyses. We review the process of sequence annotation and provide a classification of sources of error, highlighting propagation as a major source. We illustrate the value of a network perspective through three case studies that use network analysis to detect errors, and explore the quality and quantity of critical relationships that would inform such network analyses. This systematic description of a network perspective of sequence database records provides a novel direction to combat the proliferation of errors within these critical bioinformatics resources.
Collapse
Affiliation(s)
- Benjamin Goudey
- Corresponding author. Benjamin Goudey, School of Computing and Information Systems, University of Melbourne Parkville, Victoria, 3010,
| | - Nicholas Geard
- School of Computing and Information Systems, University of Melbourne Parkville, Victoria, 3010
| | - Karin Verspoor
- School of Computing Technologies, RMIT University Melbourne, Victoria, 3000
| | - Justin Zobel
- School of Computing and Information Systems, University of Melbourne Parkville, Victoria, 3010
| |
Collapse
|
2
|
Šturm MB, Smith S, Ganbaatar O, Buuveibaatar B, Balint B, Payne JC, Voigt CC, Kaczensky P. Isotope analysis combined with DNA barcoding provide new insights into the dietary niche of khulan in the Mongolian Gobi. PLoS One 2021; 16:e0248294. [PMID: 33780458 PMCID: PMC8006982 DOI: 10.1371/journal.pone.0248294] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2020] [Accepted: 02/23/2021] [Indexed: 11/26/2022] Open
Abstract
With increasing livestock numbers, competition and avoidance are increasingly shaping resource availability for wild ungulates. Shifts in the dietary niche of wild ungulates are likely and can be expected to negatively affect their fitness. The Mongolian Gobi constitutes the largest remaining refuge for several threatened ungulates, but unprecedentedly high livestock numbers are sparking growing concerns over rangeland health and impacts on threatened ungulates like the Asiatic wild ass (khulan). Previous stable isotope analysis of khulan tail hair from the Dzungarian Gobi suggested that they graze in summer but switch to a poorer mixed C3 grass / C4 shrub diet in winter, most likely in reaction to local herders and their livestock. Here we attempt to validate these findings with a different methodology, DNA metabarcoding. Further, we extend the scope of the original study to the South Gobi Region, where we expect higher proportions of low-quality browse in the khulan winter diet due to a higher human and livestock presence. Barcoding confirmed the assumptions behind the seasonal diet change observed in the Dzungarian Gobi isotope data, and new isotope analysis revealed a strong seasonal pattern and higher C4 plant intake in the South Gobi Region, in line with our expectations. However, DNA barcoding revealed C4 domination of winter diet was due to C4 grasses (rather than shrubs) for the South Gobi Region. Slight climatic differences result in regional shifts in the occurrence of C3 and C4 grasses and shrubs, which do not allow for an isotopic separation along the grazer-browser continuum over the entire Gobi. Our findings do not allow us to confirm human impacts upon dietary preferences in khulan as we lack seasonal samples from the South Gobi Region. However, these data provide novel insight into khulan diet, raise new questions about plant availability versus preference, and provide a cautionary tale about indirect analysis methods if used in isolation or extrapolated to the landscape level. Good concordance between relative read abundance of C4 genera from barcoding and proportion of C4 plants from isotope analysis adds to a growing body of evidence that barcoding is a promising quantitative tool to understand resource partitioning in ungulates.
Collapse
Affiliation(s)
- Martina Burnik Šturm
- Research Institute of Wildlife Ecology, University of Veterinary Medicine, Vienna, Austria
| | - Steve Smith
- Konrad-Lorenz Institute of Ethology, University of Veterinary Medicine, Vienna, Austria
| | - Oyunsaikhan Ganbaatar
- Great Gobi B Strictly Protected Area Administration, Takhin Tal, Gobi Altai Province, Mongolia
- Department of Zoology, School of Biology and Biotechnology, National University of Mongolia, Ulaanbaatar, Mongolia
| | | | - Boglarka Balint
- Great Gobi B Strictly Protected Area Administration, Takhin Tal, Gobi Altai Province, Mongolia
| | - John C. Payne
- Research Institute of Wildlife Ecology, University of Veterinary Medicine, Vienna, Austria
| | | | - Petra Kaczensky
- Research Institute of Wildlife Ecology, University of Veterinary Medicine, Vienna, Austria
- Norwegian Institute for Nature Research–NINA, Trondheim, Norway
| |
Collapse
|
3
|
Meyer C, Scalzitti N, Jeannin-Girardon A, Collet P, Poch O, Thompson JD. Understanding the causes of errors in eukaryotic protein-coding gene prediction: a case study of primate proteomes. BMC Bioinformatics 2020; 21:513. [PMID: 33172385 PMCID: PMC7656754 DOI: 10.1186/s12859-020-03855-1] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2020] [Accepted: 10/30/2020] [Indexed: 11/10/2022] Open
Abstract
Background Recent advances in sequencing technologies have led to an explosion in the number of genomes available, but accurate genome annotation remains a major challenge. The prediction of protein-coding genes in eukaryotic genomes is especially problematic, due to their complex exon–intron structures. Even the best eukaryotic gene prediction algorithms can make serious errors that will significantly affect subsequent analyses. Results We first investigated the prevalence of gene prediction errors in a large set of 176,478 proteins from ten primate proteomes available in public databases. Using the well-studied human proteins as a reference, a total of 82,305 potential errors were detected, including 44,001 deletions, 27,289 insertions and 11,015 mismatched segments where part of the correct protein sequence is replaced with an alternative erroneous sequence. We then focused on the mismatched sequence errors that cause particular problems for downstream applications. A detailed characterization allowed us to identify the potential causes for the gene misprediction in approximately half (5446) of these cases. As a proof-of-concept, we also developed a simple method which allowed us to propose improved sequences for 603 primate proteins. Conclusions Gene prediction errors in primate proteomes affect up to 50% of the sequences. Major causes of errors include undetermined genome regions, genome sequencing or assembly issues, and limitations in the models used to represent gene exon–intron structures. Nevertheless, existing genome sequences can still be exploited to improve protein sequence quality. Perspectives of the work include the characterization of other types of gene prediction errors, as well as the development of a more comprehensive algorithm for protein sequence error correction.
Collapse
Affiliation(s)
- Corentin Meyer
- Department of Computer Science, ICube, CNRS, University of Strasbourg, Strasbourg, France
| | - Nicolas Scalzitti
- Department of Computer Science, ICube, CNRS, University of Strasbourg, Strasbourg, France
| | - Anne Jeannin-Girardon
- Department of Computer Science, ICube, CNRS, University of Strasbourg, Strasbourg, France
| | - Pierre Collet
- Department of Computer Science, ICube, CNRS, University of Strasbourg, Strasbourg, France
| | - Olivier Poch
- Department of Computer Science, ICube, CNRS, University of Strasbourg, Strasbourg, France
| | - Julie D Thompson
- Department of Computer Science, ICube, CNRS, University of Strasbourg, Strasbourg, France.
| |
Collapse
|
4
|
Quality Matters: Biocuration Experts on the Impact of Duplication and Other Data Quality Issues in Biological Databases. GENOMICS PROTEOMICS & BIOINFORMATICS 2020; 18:91-103. [PMID: 32652120 PMCID: PMC7646089 DOI: 10.1016/j.gpb.2018.11.006] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/08/2017] [Revised: 10/24/2018] [Accepted: 12/14/2018] [Indexed: 11/27/2022]
|
5
|
Bouadjenek MR, Zobel J, Verspoor K. Automated assessment of biological database assertions using the scientific literature. BMC Bioinformatics 2019; 20:216. [PMID: 31035936 PMCID: PMC6489365 DOI: 10.1186/s12859-019-2801-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2018] [Accepted: 04/09/2019] [Indexed: 12/27/2022] Open
Abstract
BACKGROUND The large biological databases such as GenBank contain vast numbers of records, the content of which is substantively based on external resources, including published literature. Manual curation is used to establish whether the literature and the records are indeed consistent. We explore in this paper an automated method for assessing the consistency of biological assertions, to assist biocurators, which we call BARC, Biocuration tool for Assessment of Relation Consistency. In this method a biological assertion is represented as a relation between two objects (for example, a gene and a disease); we then use our novel set-based relevance algorithm SaBRA to retrieve pertinent literature, and apply a classifier to estimate the likelihood that this relation (assertion) is correct. RESULTS Our experiments on assessing gene-disease relations and protein-protein interactions using the PubMed Central collection show that BARC can be effective at assisting curators to perform data cleansing. Specifically, the results obtained showed that BARC substantially outperforms the best baselines, with an improvement of F-measure of 3.5% and 13%, respectively, on gene-disease relations and protein-protein interactions. We have additionally carried out a feature analysis that showed that all feature types are informative, as are all fields of the documents. CONCLUSIONS BARC provides a clear benefit for the biocuration community, as there are no prior automated tools for identifying inconsistent assertions in large-scale biological databases.
Collapse
Affiliation(s)
- Mohamed Reda Bouadjenek
- Department of Mechanical & Industrial Engineering, University of Toronto, Toronto, M5S 3G8 Canada
| | - Justin Zobel
- School of Computing and Information Systems, University of Melbourne, Melbourne, 3010 Australia
| | - Karin Verspoor
- School of Computing and Information Systems, University of Melbourne, Melbourne, 3010 Australia
| |
Collapse
|
6
|
Bouadjenek MR, Verspoor K. Multi-field query expansion is effective for biomedical dataset retrieval. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2017; 2017:4107606. [PMID: 29220457 PMCID: PMC5737205 DOI: 10.1093/database/bax062] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/20/2017] [Accepted: 07/31/2017] [Indexed: 01/01/2023]
Abstract
In the context of the bioCADDIE challenge addressing information retrieval of biomedical datasets, we propose a method for retrieval of biomedical data sets with heterogenous schemas through query reformulation. In particular, the method proposed transforms the initial query into a multi-field query that is then enriched with terms that are likely to occur in the relevant datasets. We compare and evaluate two query expansion strategies, one based on the Rocchio method and another based on a biomedical lexicon. We then perform a comprehensive comparative evaluation of our method on the bioCADDIE dataset collection for biomedical retrieval. We demonstrate the effectiveness of our multi-field query method compared to two baselines, with MAP improved from 0.2171 and 0.2669 to 0.2996. We also show the benefits of query expansion, where the Rocchio expanstion method improves the MAP for our two baselines from 0.2171 and 0.2669 to 0.335. We show that the Rocchio query expansion method slightly outperforms the one based on the biomedical lexicon as a source of terms, with an improvement of roughly 3% for MAP. However, the query expansion method based on the biomedical lexicon is much less resource intensive since it does not require computation of any relevance feedback set or any initial execution of the query. Hence, in term of trade-off between efficiency, execution time and retrieval accuracy, we argue that the query expansion method based on the biomedical lexicon offers the best performance for a prototype biomedical data search engine intended to be used at a large scale. In the official bioCADDIE challenge results, although our approach is ranked seventh in terms of the infNDCG evaluation metric, it ranks second in term of P@10 and NDCG. Hence, the method proposed here provides overall good retrieval performance in relation to the approaches of other competitors. Consequently, the observations made in this paper should benefit the development of a Data Discovery Index prototype or the improvement of the existing one.
Collapse
Affiliation(s)
- Mohamed Reda Bouadjenek
- School of Computing and Information Systems, The University of Melbourne, Parkville, VIC, 3010, Australia
| | - Karin Verspoor
- School of Computing and Information Systems, The University of Melbourne, Parkville, VIC, 3010, Australia
| |
Collapse
|