Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Bouadjenek MR, Verspoor K, Zobel J. Literature consistency of bioinformatics sequence databases is effective for assessing record quality. Database (Oxford) 2017;2017:3074790. [PMID: 28365737 PMCID: PMC5467556 DOI: 10.1093/database/bax021] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/01/2016] [Accepted: 02/20/2017] [Indexed: 11/18/2022]

For:	Bouadjenek MR, Verspoor K, Zobel J. Literature consistency of bioinformatics sequence databases is effective for assessing record quality. Database (Oxford) 2017;2017:3074790. [PMID: 28365737 PMCID: PMC5467556 DOI: 10.1093/database/bax021] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/01/2016] [Accepted: 02/20/2017] [Indexed: 11/18/2022]

Number

Cited by Other Article(s)

Goudey B, Geard N, Verspoor K, Zobel J. Propagation, detection and correction of errors using the sequence database network. Brief Bioinform 2022;23:6764545. [PMID: 36266246 PMCID: PMC9677457 DOI: 10.1093/bib/bbac416] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2022] [Revised: 07/31/2022] [Accepted: 08/28/2022] [Indexed: 12/14/2022] Open

Abstract

Nucleotide and protein sequences stored in public databases are the cornerstone of many bioinformatics analyses. The records containing these sequences are prone to a wide range of errors, including incorrect functional annotation, sequence contamination and taxonomic misclassification. One source of information that can help to detect errors are the strong interdependency between records. Novel sequences in one database draw their annotations from existing records, may generate new records in multiple other locations and will have varying degrees of similarity with existing records across a range of attributes. A network perspective of these relationships between sequence records, within and across databases, offers new opportunities to detect-or even correct-erroneous entries and more broadly to make inferences about record quality. Here, we describe this novel perspective of sequence database records as a rich network, which we call the sequence database network, and illustrate the opportunities this perspective offers for quantification of database quality and detection of spurious entries. We provide an overview of the relevant databases and describe how the interdependencies between sequence records across these databases can be exploited by network analyses. We review the process of sequence annotation and provide a classification of sources of error, highlighting propagation as a major source. We illustrate the value of a network perspective through three case studies that use network analysis to detect errors, and explore the quality and quantity of critical relationships that would inform such network analyses. This systematic description of a network perspective of sequence database records provides a novel direction to combat the proliferation of errors within these critical bioinformatics resources.

Collapse

Šturm MB, Smith S, Ganbaatar O, Buuveibaatar B, Balint B, Payne JC, Voigt CC, Kaczensky P. Isotope analysis combined with DNA barcoding provide new insights into the dietary niche of khulan in the Mongolian Gobi. PLoS One 2021;16:e0248294. [PMID: 33780458 PMCID: PMC8006982 DOI: 10.1371/journal.pone.0248294] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2020] [Accepted: 02/23/2021] [Indexed: 11/26/2022] Open

Abstract

With increasing livestock numbers, competition and avoidance are increasingly shaping resource availability for wild ungulates. Shifts in the dietary niche of wild ungulates are likely and can be expected to negatively affect their fitness. The Mongolian Gobi constitutes the largest remaining refuge for several threatened ungulates, but unprecedentedly high livestock numbers are sparking growing concerns over rangeland health and impacts on threatened ungulates like the Asiatic wild ass (khulan). Previous stable isotope analysis of khulan tail hair from the Dzungarian Gobi suggested that they graze in summer but switch to a poorer mixed C3 grass / C4 shrub diet in winter, most likely in reaction to local herders and their livestock. Here we attempt to validate these findings with a different methodology, DNA metabarcoding. Further, we extend the scope of the original study to the South Gobi Region, where we expect higher proportions of low-quality browse in the khulan winter diet due to a higher human and livestock presence. Barcoding confirmed the assumptions behind the seasonal diet change observed in the Dzungarian Gobi isotope data, and new isotope analysis revealed a strong seasonal pattern and higher C4 plant intake in the South Gobi Region, in line with our expectations. However, DNA barcoding revealed C4 domination of winter diet was due to C4 grasses (rather than shrubs) for the South Gobi Region. Slight climatic differences result in regional shifts in the occurrence of C3 and C4 grasses and shrubs, which do not allow for an isotopic separation along the grazer-browser continuum over the entire Gobi. Our findings do not allow us to confirm human impacts upon dietary preferences in khulan as we lack seasonal samples from the South Gobi Region. However, these data provide novel insight into khulan diet, raise new questions about plant availability versus preference, and provide a cautionary tale about indirect analysis methods if used in isolation or extrapolated to the landscape level. Good concordance between relative read abundance of C4 genera from barcoding and proportion of C4 plants from isotope analysis adds to a growing body of evidence that barcoding is a promising quantitative tool to understand resource partitioning in ungulates.

Collapse

Meyer C, Scalzitti N, Jeannin-Girardon A, Collet P, Poch O, Thompson JD. Understanding the causes of errors in eukaryotic protein-coding gene prediction: a case study of primate proteomes. BMC Bioinformatics 2020;21:513. [PMID: 33172385 PMCID: PMC7656754 DOI: 10.1186/s12859-020-03855-1] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2020] [Accepted: 10/30/2020] [Indexed: 11/10/2022] Open

Quality Matters: Biocuration Experts on the Impact of Duplication and Other Data Quality Issues in Biological Databases. GENOMICS PROTEOMICS & BIOINFORMATICS 2020;18:91-103. [PMID: 32652120 PMCID: PMC7646089 DOI: 10.1016/j.gpb.2018.11.006] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/08/2017] [Revised: 10/24/2018] [Accepted: 12/14/2018] [Indexed: 11/27/2022]

Bouadjenek MR, Zobel J, Verspoor K. Automated assessment of biological database assertions using the scientific literature. BMC Bioinformatics 2019;20:216. [PMID: 31035936 PMCID: PMC6489365 DOI: 10.1186/s12859-019-2801-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2018] [Accepted: 04/09/2019] [Indexed: 12/27/2022] Open

Bouadjenek MR, Verspoor K. Multi-field query expansion is effective for biomedical dataset retrieval. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2017;2017:4107606. [PMID: 29220457 PMCID: PMC5737205 DOI: 10.1093/database/bax062] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/20/2017] [Accepted: 07/31/2017] [Indexed: 01/01/2023]

Abstract

In the context of the bioCADDIE challenge addressing information retrieval of biomedical datasets, we propose a method for retrieval of biomedical data sets with heterogenous schemas through query reformulation. In particular, the method proposed transforms the initial query into a multi-field query that is then enriched with terms that are likely to occur in the relevant datasets. We compare and evaluate two query expansion strategies, one based on the Rocchio method and another based on a biomedical lexicon. We then perform a comprehensive comparative evaluation of our method on the bioCADDIE dataset collection for biomedical retrieval. We demonstrate the effectiveness of our multi-field query method compared to two baselines, with MAP improved from 0.2171 and 0.2669 to 0.2996. We also show the benefits of query expansion, where the Rocchio expanstion method improves the MAP for our two baselines from 0.2171 and 0.2669 to 0.335. We show that the Rocchio query expansion method slightly outperforms the one based on the biomedical lexicon as a source of terms, with an improvement of roughly 3% for MAP. However, the query expansion method based on the biomedical lexicon is much less resource intensive since it does not require computation of any relevance feedback set or any initial execution of the query. Hence, in term of trade-off between efficiency, execution time and retrieval accuracy, we argue that the query expansion method based on the biomedical lexicon offers the best performance for a prototype biomedical data search engine intended to be used at a large scale. In the official bioCADDIE challenge results, although our approach is ranked seventh in terms of the infNDCG evaluation metric, it ranks second in term of P@10 and NDCG. Hence, the method proposed here provides overall good retrieval performance in relation to the approaches of other competitors. Consequently, the observations made in this paper should benefit the development of a Data Discovery Index prototype or the improvement of the existing one.

Collapse