1
|
Chen H, Li R, Cleveland A, Ding J. Enhancing data quality in medical concept normalization through large language models. J Biomed Inform 2025; 165:104812. [PMID: 40180205 DOI: 10.1016/j.jbi.2025.104812] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2024] [Revised: 01/26/2025] [Accepted: 03/13/2025] [Indexed: 04/05/2025]
Abstract
OBJECTIVE Medical concept normalization (MCN) aims to map informal medical terms to formal medical concepts, a critical task in building machine learning systems for medical applications. However, most existing studies on MCN primarily focus on models and algorithms, often overlooking the vital role of data quality. This research evaluates MCN performance across varying data quality scenarios and investigates how to leverage these evaluation results to enhance data quality, ultimately improving MCN performance through the use of large language models (LLMs). The effectiveness of the proposed approach is demonstrated through a case study. METHODS We begin by conducting a data quality evaluation of a dataset used for MCN. Based on these findings, we employ ChatGPT-based zero-shot prompting for data augmentation. The quality of the generated data is then assessed across the dimensions of correctness and comprehensiveness. A series of experiments is performed to analyze the impact of data quality on MCN model performance. These results guide us in implementing LLM-based few-shot prompting to further enhance data quality and improve model performance. RESULTS Duplication of data items within a dataset can lead to inaccurate evaluation results. Data augmentation techniques such as zero-shot and few-shot learning with ChatGPT can introduce duplicated data items, particularly those in the mean region of a dataset's distribution. As such, data augmentation strategies must be carefully designed, incorporating context information and training data to avoid these issues. Additionally, we found that including augmented data in the testing set is necessary to fairly evaluate the effectiveness of data augmentation strategies. CONCLUSION While LLMs can generate high-quality data for MCN, the success of data augmentation depends heavily on the strategy employed. Our study found that few-shot learning, with prompts that incorporate appropriate context and a small, representative set of original data, is an effective approach. The methods developed in this research, including the data quality evaluation framework, LLM-based data augmentation strategies, and procedures for data quality enhancement, provide valuable insights for data augmentation and evaluation in similar deep learning applications. AVAILABILITY https://github.com/RichardLRC/mcn-data-quality-llm/tree/main/evaluation.
Collapse
Affiliation(s)
- Haihua Chen
- The Anuradha & Vikas Sinha Department of Data Science, University of North Texas, Denton, 76203, TX, USA.
| | - Ruochi Li
- Department of Computer Science, North Carolina State University, Raleigh, 27695, NC, USA.
| | - Ana Cleveland
- Department of Information Science, University of North Texas, Denton, 76203, TX, USA.
| | - Junhua Ding
- The Anuradha & Vikas Sinha Department of Data Science, University of North Texas, Denton, 76203, TX, USA.
| |
Collapse
|
2
|
Mora-Márquez F, Hurtado M, López de Heredia U. gymnotoa-db: a database and application to optimize functional annotation in gymnosperms. Database (Oxford) 2025; 2025:baaf019. [PMID: 40052362 PMCID: PMC11886576 DOI: 10.1093/database/baaf019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2024] [Revised: 02/11/2025] [Accepted: 02/18/2025] [Indexed: 03/09/2025]
Abstract
Gymnosperms are a clade of non-flowering plants that include about 1000 living species. Due to their complex genomes and lack of genomic resources, functional annotation in genomics and transcriptomics on gymnosperms suffers from limitations. Here we present gymnotoa-db, which is a novel, publicly accessible relational database designed to facilitate functional annotation in gymnosperms. This database stores non-redundant records of gymnosperm proteins, encompassing taxonomic and functional information. The complementary software, gymnotoa-app, enables users to download gymnotoa-db and execute a comprehensive functional annotation pipeline for high-throughput sequencing-derived DNA or cDNA sequences. gymnotoa-app's user-friendly interface and efficient algorithms streamline the functional annotation process, making it an invaluable tool for researchers studying gymnosperms. We compared gymnotoa-app's performance against other annotation tools utilizing disparate reference databases. Our results demonstrate gymnotoa-app's superior ability to accurately annotate gymnosperm transcripts, recovering a greater number of transcripts and unique, non-redundant Gene Ontology terms. gymnotoa-db's distinctive features include comprehensive coverage with a non-redundant dataset of gymnosperm protein sequences, robust functional information that integrates data from multiple ontology systems, including GO, KEGG, EC, and MetaCYC, while keeping the taxonomic context, including Arabidopsis homologs. Database URL: https://blogs.upm.es/gymnotoa-db/2024/09/19/gymnotoa-app/.
Collapse
Affiliation(s)
- Fernando Mora-Márquez
- GI en Desarrollo de Especies y Comunidades Leñosas (WooSP), Dpto. Sistemas y Recursos Naturales, ETSI Montes, Forestal y del Medio Natural, Universidad Politécnica de Madrid, José Antonio Novais 10, Madrid 28040, Spain
| | - Mikel Hurtado
- GI en Desarrollo de Especies y Comunidades Leñosas (WooSP), Dpto. Sistemas y Recursos Naturales, ETSI Montes, Forestal y del Medio Natural, Universidad Politécnica de Madrid, José Antonio Novais 10, Madrid 28040, Spain
| | - Unai López de Heredia
- GI en Desarrollo de Especies y Comunidades Leñosas (WooSP), Dpto. Sistemas y Recursos Naturales, ETSI Montes, Forestal y del Medio Natural, Universidad Politécnica de Madrid, José Antonio Novais 10, Madrid 28040, Spain
| |
Collapse
|
3
|
Kuang Z, Yan X, Yuan Y, Wang R, Zhu H, Wang Y, Li J, Ye J, Yue H, Yang X. Advances in stress-tolerance elements for microbial cell factories. Synth Syst Biotechnol 2024; 9:793-808. [PMID: 39072145 PMCID: PMC11277822 DOI: 10.1016/j.synbio.2024.06.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2024] [Revised: 06/10/2024] [Accepted: 06/27/2024] [Indexed: 07/30/2024] Open
Abstract
Microorganisms, particularly extremophiles, have evolved multiple adaptation mechanisms to address diverse stress conditions during survival in unique environments. Their responses to environmental coercion decide not only survival in severe conditions but are also an essential factor determining bioproduction performance. The design of robust cell factories should take the balance of their growing and bioproduction into account. Thus, mining and redesigning stress-tolerance elements to optimize the performance of cell factories under various extreme conditions is necessary. Here, we reviewed several stress-tolerance elements, including acid-tolerant elements, saline-alkali-resistant elements, thermotolerant elements, antioxidant elements, and so on, providing potential materials for the construction of cell factories and the development of synthetic biology. Strategies for mining and redesigning stress-tolerance elements were also discussed. Moreover, several applications of stress-tolerance elements were provided, and perspectives and discussions for potential strategies for screening stress-tolerance elements were made.
Collapse
Affiliation(s)
- Zheyi Kuang
- School of Intelligence Science and Technology, Xinjiang University, Urumqi, 830017, China
| | - Xiaofang Yan
- School of Biology and Biological Engineering, South China University of Technology, Guangzhou, 510006, China
| | - Yanfei Yuan
- School of Biology and Biological Engineering, South China University of Technology, Guangzhou, 510006, China
| | - Ruiqi Wang
- School of Intelligence Science and Technology, Xinjiang University, Urumqi, 830017, China
| | - Haifan Zhu
- School of Intelligence Science and Technology, Xinjiang University, Urumqi, 830017, China
| | - Youyang Wang
- School of Intelligence Science and Technology, Xinjiang University, Urumqi, 830017, China
| | - Jianfeng Li
- School of Biology and Biological Engineering, South China University of Technology, Guangzhou, 510006, China
| | - Jianwen Ye
- School of Biology and Biological Engineering, South China University of Technology, Guangzhou, 510006, China
| | - Haitao Yue
- School of Intelligence Science and Technology, Xinjiang University, Urumqi, 830017, China
- Laboratory of Synthetic Biology, School of Life Science and Technology, Xinjiang University, Urumqi, 830017, China
| | - Xiaofeng Yang
- School of Biology and Biological Engineering, South China University of Technology, Guangzhou, 510006, China
| |
Collapse
|
4
|
Chen J, Goudey B, Geard N, Verspoor K. Integration of background knowledge for automatic detection of inconsistencies in gene ontology annotation. Bioinformatics 2024; 40:i390-i400. [PMID: 38940182 PMCID: PMC11256942 DOI: 10.1093/bioinformatics/btae246] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/29/2024] Open
Abstract
MOTIVATION Biological background knowledge plays an important role in the manual quality assurance (QA) of biological database records. One such QA task is the detection of inconsistencies in literature-based Gene Ontology Annotation (GOA). This manual verification ensures the accuracy of the GO annotations based on a comprehensive review of the literature used as evidence, Gene Ontology (GO) terms, and annotated genes in GOA records. While automatic approaches for the detection of semantic inconsistencies in GOA have been developed, they operate within predetermined contexts, lacking the ability to leverage broader evidence, especially relevant domain-specific background knowledge. This paper investigates various types of background knowledge that could improve the detection of prevalent inconsistencies in GOA. In addition, the paper proposes several approaches to integrate background knowledge into the automatic GOA inconsistency detection process. RESULTS We have extended a previously developed GOA inconsistency dataset with several kinds of GOA-related background knowledge, including GeneRIF statements, biological concepts mentioned within evidence texts, GO hierarchy and existing GO annotations of the specific gene. We have proposed several effective approaches to integrate background knowledge as part of the automatic GOA inconsistency detection process. The proposed approaches can improve automatic detection of self-consistency and several of the most prevalent types of inconsistencies. This is the first study to explore the advantages of utilizing background knowledge and to propose a practical approach to incorporate knowledge in automatic GOA inconsistency detection. We establish a new benchmark for performance on this task. Our methods may be applicable to various tasks that involve incorporating biological background knowledge. AVAILABILITY AND IMPLEMENTATION https://github.com/jiyuc/de-inconsistency.
Collapse
Affiliation(s)
- Jiyu Chen
- School of Computing and Information Systems, The University of Melbourne, Parkville 3010, VIC, Australia
- Data61, The Commonwealth Scientific and Industrial Research Organisation, Marsfield 2122, NSW, Australia
| | - Benjamin Goudey
- School of Computing and Information Systems, The University of Melbourne, Parkville 3010, VIC, Australia
| | - Nicholas Geard
- School of Computing and Information Systems, The University of Melbourne, Parkville 3010, VIC, Australia
| | - Karin Verspoor
- School of Computing Technologies, RMIT University, Melbourne, Victoria 3000, Australia
| |
Collapse
|
5
|
De Coninck T, Gippert GP, Henrissat B, Desmet T, Van Damme EJM. Investigating diversity and similarity between CBM13 modules and ricin-B lectin domains using sequence similarity networks. BMC Genomics 2024; 25:643. [PMID: 38937673 PMCID: PMC11212257 DOI: 10.1186/s12864-024-10554-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2024] [Accepted: 06/24/2024] [Indexed: 06/29/2024] Open
Abstract
BACKGROUND The CBM13 family comprises carbohydrate-binding modules that occur mainly in enzymes and in several ricin-B lectins. The ricin-B lectin domain resembles the CBM13 module to a large extent. Historically, ricin-B lectins and CBM13 proteins were considered completely distinct, despite their structural and functional similarities. RESULTS In this data mining study, we investigate structural and functional similarities of these intertwined protein groups. Because of the high structural and functional similarities, and differences in nomenclature usage in several databases, confusion can arise. First, we demonstrate how public protein databases use different nomenclature systems to describe CBM13 modules and putative ricin-B lectin domains. We suggest the introduction of a novel CBM13 domain identifier, as well as the extension of CAZy cross-references in UniProt to guard the distinction between CAZy and non-CAZy entries in public databases. Since similar problems may occur with other lectin families and CBM families, we suggest the introduction of novel CBM InterPro domain identifiers to all existing CBM families. Second, we investigated phylogenetic, nomenclatural and structural similarities between putative ricin-B lectin domains and CBM13 modules, making use of sequence similarity networks. We concluded that the ricin-B/CBM13 superfamily may be larger than initially thought and that several putative ricin-B lectin domains may display CAZyme functionalities, although biochemical proof remains to be delivered. CONCLUSIONS Ricin-B lectin domains and CBM13 modules are associated groups of proteins whose database semantics are currently biased towards ricin-B lectins. Revision of the CAZy cross-reference in UniProt and introduction of a dedicated CBM13 domain identifier in InterPro may resolve this issue. In addition, our analyses show that several proteins with putative ricin-B lectin domains show very strong structural similarity to CBM13 modules. Therefore ricin-B lectin domains and CBM13 modules could be considered distant members of a larger ricin-B/CBM13 superfamily.
Collapse
Affiliation(s)
- Tibo De Coninck
- Laboratory of Biochemistry and Glycobiology, Department of Biotechnology, Ghent University, Proeftuinstraat 86, Ghent, 9000, Belgium
- Centre for Synthetic Biology, Department of Biotechnology, Ghent University, Coupure Links 653, Ghent, 9000, Belgium
| | - Garry P Gippert
- Section for Protein Chemistry and Enzyme Technology, Department of Biotechnology & Biomedicine, Technical University of Denmark, Søltofts Plads 224, Kgs. Lyngby, 2800, Denmark
| | - Bernard Henrissat
- Section for Protein Chemistry and Enzyme Technology, Department of Biotechnology & Biomedicine, Technical University of Denmark, Søltofts Plads 224, Kgs. Lyngby, 2800, Denmark
| | - Tom Desmet
- Centre for Synthetic Biology, Department of Biotechnology, Ghent University, Coupure Links 653, Ghent, 9000, Belgium
| | - Els J M Van Damme
- Laboratory of Biochemistry and Glycobiology, Department of Biotechnology, Ghent University, Proeftuinstraat 86, Ghent, 9000, Belgium.
| |
Collapse
|
6
|
Chong LC, Khan AM. A Systematic Bioinformatics Approach for Mapping the Minimal Set of a Viral Peptidome. Curr Protoc 2024; 4:e1056. [PMID: 38856995 DOI: 10.1002/cpz1.1056] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/11/2024]
Abstract
Sequence changes in viral genomes generate protein sequence diversity that enables viruses to evade the host immune system, hindering the development of effective preventive and therapeutic interventions. The massive proliferation of sequence data provides unprecedented opportunities to study viral adaptation and evolution. An alignment-free approach removes various restrictions posed by an alignment-dependent approach for studying sequence diversity. The publicly available tool, UNIQmin, offers an alignment-free approach for studying viral sequence diversity at any given rank of taxonomy lineage and is big data ready. The tool performs an exhaustive search to determine the minimal set of sequences required to capture the peptidome diversity within a given dataset. This compression is possible through the removal of identical sequences and unique sequences that do not contribute effectively to the peptidome diversity pool. Herein, we describe a detailed four-part protocol utilizing UNIQmin to generate the minimal set for the purpose of viral diversity analyses, alignment-free at any rank of the taxonomy lineage, using the recent global public health threat Monkeypox virus (MPX) sequence data as a case study. The protocol enables a systematic bioinformatics approach to study sequence diversity across taxonomic lineages, which is crucial for our future preparedness against viral epidemics. This is particularly important when data are abundant, freely available, and alignment is not an option. © 2024 Wiley Periodicals LLC. Basic Protocol 1: Tool installation and input file preparation Basic Protocol 2: Generation of a minimal set of sequences for a given dataset Basic Protocol 3: Comparative minimal set analysis across taxonomic lineage ranks Basic Protocol 4: Factors affecting the minimal set of sequences.
Collapse
Affiliation(s)
- Li Chuin Chong
- Centre for Bioinformatics, School of Data Sciences, Perdana University, Kuala Lumpur, Malaysia
- Beykoz Institute of Life Sciences and Biotechnology, Bezmialem Vakif University, Beykoz, Turkey
- Current affiliation: Institute for Experimental Virology, TWINCORE Centre for Experimental and Clinical Infection Research, a Medical School Hannover (MHH) and Helmholtz Centre for Infection Research (HZI) joint venture, Hannover, Germany
| | - Asif M Khan
- Centre for Bioinformatics, School of Data Sciences, Perdana University, Kuala Lumpur, Malaysia
- Beykoz Institute of Life Sciences and Biotechnology, Bezmialem Vakif University, Beykoz, Turkey
- Current affiliation: College of Computing and Information Technology, University of Doha for Science and Technology, Doha, Qatar
| |
Collapse
|
7
|
de Oliveira SG, Kotowski N, Sampaio-Filho HR, Aguiar FHB, Dávila AMR, Jardim R. Metalloproteinases in Restorative Dentistry: An In Silico Study toward an Ideal Animal Model. Biomedicines 2023; 11:3042. [PMID: 38002041 PMCID: PMC10669239 DOI: 10.3390/biomedicines11113042] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2023] [Revised: 09/02/2023] [Accepted: 09/13/2023] [Indexed: 11/26/2023] Open
Abstract
In dentistry, various animal models are used to evaluate adhesive systems, dental caries and periodontal diseases. Metalloproteinases (MMPs) are enzymes that degrade collagen in the dentin matrix and are categorized in over 20 different classes. Collagenases and gelatinases are intrinsic constituents of the human dentin organic matrix fibrillar network and are the most abundant MMPs in this tissue. Understanding such enzymes' action on dentin is important in the development of approaches that could reduce dentin degradation and provide restorative procedures with extended longevity. This in silico study is based on dentistry's most used animal models and intends to search for the most suitable, evolutionarily close to Homo sapiens. We were able to retrieve 176,077 mammalian MMP sequences from the UniProt database. These sequences were manually curated through a three-step process. After such, the remaining 3178 sequences were aligned in a multifasta file and phylogenetically reconstructed using the maximum likelihood method. Our study inferred that the animal models most evolutionarily related to Homo sapiens were Orcytolagus cuniculus (MMP-1 and MMP-8), Canis lupus (MMP-13), Rattus norvegicus (MMP-2) and Orcytolagus cuniculus (MMP-9). Further research will be needed for the biological validation of our findings.
Collapse
Affiliation(s)
- Simone Gomes de Oliveira
- Piracicaba School of Dentistry, Campinas State University, Piracicaba 13414-903, SP, Brazil
- School of Dentistry, State University of Rio de Janeiro, Rio de Janeiro 20551-030, RJ, Brazil
| | - Nelson Kotowski
- Computational and Systems Biology Laboratory, Oswaldo Cruz Institute, Oswaldo Cruz Foundation, Rio de Janeiro 21040-900, RJ, Brazil; (N.K.); (A.M.R.D.)
| | | | | | - Alberto Martín Rivera Dávila
- Computational and Systems Biology Laboratory, Oswaldo Cruz Institute, Oswaldo Cruz Foundation, Rio de Janeiro 21040-900, RJ, Brazil; (N.K.); (A.M.R.D.)
| | - Rodrigo Jardim
- Computational and Systems Biology Laboratory, Oswaldo Cruz Institute, Oswaldo Cruz Foundation, Rio de Janeiro 21040-900, RJ, Brazil; (N.K.); (A.M.R.D.)
| |
Collapse
|
8
|
Wei ZG, Chen X, Zhang XD, Zhang H, Fan XG, Gao HY, Liu F, Qian Y. Comparison of Methods for Biological Sequence Clustering. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:2874-2888. [PMID: 37028305 DOI: 10.1109/tcbb.2023.3253138] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Recent advances in sequencing technology have considerably promoted genomics research by providing high-throughput sequencing economically. This great advancement has resulted in a huge amount of sequencing data. Clustering analysis is powerful to study and probe the large-scale sequence data. A number of available clustering methods have been developed in the last decade. Despite numerous comparison studies being published, we noticed that they have two main limitations: only traditional alignment-based clustering methods are compared and the evaluation metrics heavily rely on labeled sequence data. In this study, we present a comprehensive benchmark study for sequence clustering methods. Specifically, i) alignment-based clustering algorithms including classical (e.g., CD-HIT, UCLUST, VSEARCH) and recently proposed methods (e.g., MMseq2, Linclust, edClust) are assessed; ii) two alignment-free methods (e.g., LZW-Kernel and Mash) are included to compare with alignment-based methods; and iii) different evaluation measures based on the true labels (supervised metrics) and the input data itself (unsupervised metrics) are applied to quantify their clustering results. The aims of this study are to help biological analyzers in choosing one reasonable clustering algorithm for processing their collected sequences, and furthermore, motivate algorithm designers to develop more efficient sequence clustering approaches.
Collapse
|
9
|
Li Y, Ma B, Hua K, Gong H, He R, Luo R, Bi D, Zhou R, Langford PR, Jin H. PPNet: Identifying Functional Association Networks by Phylogenetic Profiling of Prokaryotic Genomes. Microbiol Spectr 2023; 11:e0387122. [PMID: 36602356 PMCID: PMC9927313 DOI: 10.1128/spectrum.03871-22] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2022] [Accepted: 12/01/2022] [Indexed: 01/06/2023] Open
Abstract
Identification of microbial functional association networks allows interpretation of biological phenomena and a greater understanding of the molecular basis of pathogenicity and also underpins the formulation of control measures. Here, we describe PPNet, a tool that uses genome information and analysis of phylogenetic profiles with binary similarity and distance measures to derive large-scale bacterial gene association networks of a single species. As an exemplar, we have derived a functional association network in the pig pathogen Streptococcus suis using 81 binary similarity and dissimilarity measures which demonstrates excellent performance based on the area under the receiver operating characteristic (AUROC), the area under the precision-recall (AUPR), and a derived overall scoring method. Selected network associations were validated experimentally by using bacterial two-hybrid experiments. We conclude that PPNet, a publicly available (https://github.com/liyangjie/PPNet), can be used to construct microbial association networks from easily acquired genome-scale data. IMPORTANCE This study developed PPNet, the first tool that can be used to infer large-scale bacterial functional association networks of a single species. PPNet includes a method for assigning the uniqueness of a bacterial strain using the average nucleotide identity and the average nucleotide coverage. PPNet collected 81 binary similarity and distance measures for phylogenetic profiling and then evaluated and divided them into four groups. PPNet can effectively capture gene networks that are functionally related to phenotype from publicly prokaryotic genomes, as well as provide valuable results for downstream analysis and experiment testing.
Collapse
Affiliation(s)
- Yangjie Li
- State Key Laboratory of Agricultural Microbiology, Huazhong Agricultural University, Wuhan, China
- College of Animal Medicine, Huazhong Agricultural University, Wuhan, China
- Hubei Provincial Key Laboratory of Preventive Veterinary Medicine, Huazhong Agricultural University, Wuhan, China
- College of Informatics, Huazhong Agricultural University, Wuhan, China
| | - Bin Ma
- State Key Laboratory of Agricultural Microbiology, Huazhong Agricultural University, Wuhan, China
- College of Animal Medicine, Huazhong Agricultural University, Wuhan, China
- Hubei Provincial Key Laboratory of Preventive Veterinary Medicine, Huazhong Agricultural University, Wuhan, China
| | - Kexin Hua
- State Key Laboratory of Agricultural Microbiology, Huazhong Agricultural University, Wuhan, China
- College of Animal Medicine, Huazhong Agricultural University, Wuhan, China
- Hubei Provincial Key Laboratory of Preventive Veterinary Medicine, Huazhong Agricultural University, Wuhan, China
| | - Huimin Gong
- State Key Laboratory of Agricultural Microbiology, Huazhong Agricultural University, Wuhan, China
- College of Animal Medicine, Huazhong Agricultural University, Wuhan, China
- Hubei Provincial Key Laboratory of Preventive Veterinary Medicine, Huazhong Agricultural University, Wuhan, China
| | - Rongrong He
- State Key Laboratory of Agricultural Microbiology, Huazhong Agricultural University, Wuhan, China
- College of Animal Medicine, Huazhong Agricultural University, Wuhan, China
- Hubei Provincial Key Laboratory of Preventive Veterinary Medicine, Huazhong Agricultural University, Wuhan, China
| | - Rui Luo
- State Key Laboratory of Agricultural Microbiology, Huazhong Agricultural University, Wuhan, China
- College of Animal Medicine, Huazhong Agricultural University, Wuhan, China
- Hubei Provincial Key Laboratory of Preventive Veterinary Medicine, Huazhong Agricultural University, Wuhan, China
| | - Dingren Bi
- State Key Laboratory of Agricultural Microbiology, Huazhong Agricultural University, Wuhan, China
- College of Animal Medicine, Huazhong Agricultural University, Wuhan, China
- Hubei Provincial Key Laboratory of Preventive Veterinary Medicine, Huazhong Agricultural University, Wuhan, China
| | - Rui Zhou
- State Key Laboratory of Agricultural Microbiology, Huazhong Agricultural University, Wuhan, China
- College of Animal Medicine, Huazhong Agricultural University, Wuhan, China
- Hubei Provincial Key Laboratory of Preventive Veterinary Medicine, Huazhong Agricultural University, Wuhan, China
| | - Paul R. Langford
- Section of Paediatric Infectious Disease, Imperial College London, St Mary’s Campus, London, United Kingdom
| | - Hui Jin
- State Key Laboratory of Agricultural Microbiology, Huazhong Agricultural University, Wuhan, China
- College of Animal Medicine, Huazhong Agricultural University, Wuhan, China
- Hubei Provincial Key Laboratory of Preventive Veterinary Medicine, Huazhong Agricultural University, Wuhan, China
| |
Collapse
|
10
|
DeSantis TZ, Cardona C, Narayan NR, Viswanatham S, Ravichandar D, Wee B, Chow CE, Iwai S. StrainSelect: A novel microbiome reference database that disambiguates all bacterial strains, genome assemblies and extant cultures worldwide. Heliyon 2023; 9:e13314. [PMID: 36814618 PMCID: PMC9939595 DOI: 10.1016/j.heliyon.2023.e13314] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2022] [Revised: 12/01/2022] [Accepted: 01/26/2023] [Indexed: 02/05/2023] Open
Abstract
Motivation: Microbial metagenomic profiling software and databases are advancing rapidly for development of novel disease biomarkers and therapeutics yet three problems impede analyses: 1) the conflation of "genome assembly" and "strain" in reference databases; 2) difficulty connecting DNA biomarkers to a procurable strain for laboratory experimentation; and 3) absence of a comprehensive and unified strain-resolved reference database for integrating both shotgun metagenomics and 16S rRNA gene data. Results: We demarcated 681,087 strains, the largest collection of its kind, by filtering public data into a knowledge graph of vertices representing contiguous DNA sequences, genome assemblies, strain monikers and bio-resource center (BRC) catalog numbers then adding inter-vertex edges only for synonyms or direct derivatives. Surprisingly, for 10,043 important strains, we found replicate RefSeq genome assemblies obstructing interpretation of database searches. We organized each strain into eight taxonomic ranks with bootstrap confidence inversely correlated with genome assembly contamination. The StrainSelect database is suited for applications where a taxonomic, functional or procurement reference is needed for shotgun or amplicon metagenomics since 636,568 strains have at least one 16S rRNA gene, 245,005 have at least one annotated genome assembly, and 36,671 are procurable from at least one BRC. The database overcomes all three aforementioned problems since it disambiguates strains from assemblies, locates strains at BRCs, and unifies a taxonomic reference for both 16S rRNA and shotgun metagenomics. Availability: The StrainSelect database is available in igraph and tabular vertex-edge formats compatible with Neo4J. Dereplicated MinHash and fasta databases are distributed for sourmash and usearch pipelines at http://strainselect.secondgenome.com. Contact:todd.desantis@gmail.com. Supplementary information: Supplementary data are available online.
Collapse
Affiliation(s)
- Todd Z. DeSantis
- Second Genome, Inc., 1000 Marina Blvd, Suite 500, Brisbane, CA, 94005, USA,Environmental Metagenomics, Research Center One Health Ruhr of the University Alliance Ruhr, Faculty of Chemistry, University of Duisburg-Essen, Germany,Corresponding author at: Second Genome, Inc., 1000 Marina Blvd, Suite 500, Brisbane, CA, 94005, USA.
| | - Cesar Cardona
- Second Genome, Inc., 1000 Marina Blvd, Suite 500, Brisbane, CA, 94005, USA
| | - Nicole R. Narayan
- Second Genome, Inc., 1000 Marina Blvd, Suite 500, Brisbane, CA, 94005, USA
| | - Satish Viswanatham
- Second Genome, Inc., 1000 Marina Blvd, Suite 500, Brisbane, CA, 94005, USA
| | - Divya Ravichandar
- Second Genome, Inc., 1000 Marina Blvd, Suite 500, Brisbane, CA, 94005, USA
| | - Brendan Wee
- Second Genome, Inc., 1000 Marina Blvd, Suite 500, Brisbane, CA, 94005, USA
| | | | - Shoko Iwai
- Second Genome, Inc., 1000 Marina Blvd, Suite 500, Brisbane, CA, 94005, USA
| |
Collapse
|
11
|
Fullam A, Letunic I, Schmidt TSB, Ducarmon QR, Karcher N, Khedkar S, Kuhn M, Larralde M, Maistrenko O, Malfertheiner L, Milanese A, Rodrigues J, Sanchis-López C, Schudoma C, Szklarczyk D, Sunagawa S, Zeller G, Huerta-Cepas J, von Mering C, Bork P, Mende DR. proGenomes3: approaching one million accurately and consistently annotated high-quality prokaryotic genomes. Nucleic Acids Res 2023; 51:D760-D766. [PMID: 36408900 PMCID: PMC9825469 DOI: 10.1093/nar/gkac1078] [Citation(s) in RCA: 25] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2022] [Revised: 10/15/2022] [Accepted: 11/07/2022] [Indexed: 11/22/2022] Open
Abstract
The interpretation of genomic, transcriptomic and other microbial 'omics data is highly dependent on the availability of well-annotated genomes. As the number of publicly available microbial genomes continues to increase exponentially, the need for quality control and consistent annotation is becoming critical. We present proGenomes3, a database of 907 388 high-quality genomes containing 4 billion genes that passed stringent criteria and have been consistently annotated using multiple functional and taxonomic databases including mobile genetic elements and biosynthetic gene clusters. proGenomes3 encompasses 41 171 species-level clusters, defined based on universal single copy marker genes, for which pan-genomes and contextual habitat annotations are provided. The database is available at http://progenomes.embl.de/.
Collapse
Affiliation(s)
- Anthony Fullam
- Structural and Computational Biology Unit, European Molecular Biology Laboratory, 69117 Heidelberg, Germany
| | - Ivica Letunic
- Biobyte solutions GmbH, Bothestr. 142, 69117 Heidelberg, Germany
| | - Thomas S B Schmidt
- Structural and Computational Biology Unit, European Molecular Biology Laboratory, 69117 Heidelberg, Germany
| | - Quinten R Ducarmon
- Structural and Computational Biology Unit, European Molecular Biology Laboratory, 69117 Heidelberg, Germany
| | - Nicolai Karcher
- Structural and Computational Biology Unit, European Molecular Biology Laboratory, 69117 Heidelberg, Germany
| | - Supriya Khedkar
- Structural and Computational Biology Unit, European Molecular Biology Laboratory, 69117 Heidelberg, Germany
| | - Michael Kuhn
- Structural and Computational Biology Unit, European Molecular Biology Laboratory, 69117 Heidelberg, Germany
| | - Martin Larralde
- Structural and Computational Biology Unit, European Molecular Biology Laboratory, 69117 Heidelberg, Germany
| | - Oleksandr M Maistrenko
- Royal Netherlands Institute for Sea Research (NIOZ), Department of Marine Microbiology & Biogeochemistry, 1797 SZ, ’t Horntje (Texel), Netherlands
| | - Lukas Malfertheiner
- Department of Molecular Life Sciences and Swiss Institute of Bioinformatics, University of Zurich, 8057 Zurich, Switzerland
| | - Alessio Milanese
- Institute of Microbiology, Department of Biology and Swiss Institute of Bioinformatics, ETH Zurich, Vladimir-Prelog-Weg 4, 8093 Zurich, Switzerland
| | | | - Claudia Sanchis-López
- Centro de Biotecnología y Genómica de Plantas, Universidad Politécnica de Madrid (UPM) - Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA-CSIC), Campus de Montegancedo-UPM, 28223, Pozuelo de Alarcón, Madrid, Spain
| | - Christian Schudoma
- Structural and Computational Biology Unit, European Molecular Biology Laboratory, 69117 Heidelberg, Germany
| | - Damian Szklarczyk
- Department of Molecular Life Sciences and Swiss Institute of Bioinformatics, University of Zurich, 8057 Zurich, Switzerland
| | - Shinichi Sunagawa
- Institute of Microbiology, Department of Biology and Swiss Institute of Bioinformatics, ETH Zurich, Vladimir-Prelog-Weg 4, 8093 Zurich, Switzerland
| | - Georg Zeller
- Structural and Computational Biology Unit, European Molecular Biology Laboratory, 69117 Heidelberg, Germany
| | - Jaime Huerta-Cepas
- Centro de Biotecnología y Genómica de Plantas, Universidad Politécnica de Madrid (UPM) - Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA-CSIC), Campus de Montegancedo-UPM, 28223, Pozuelo de Alarcón, Madrid, Spain
| | - Christian von Mering
- Department of Molecular Life Sciences and Swiss Institute of Bioinformatics, University of Zurich, 8057 Zurich, Switzerland
| | - Peer Bork
- Structural and Computational Biology Unit, European Molecular Biology Laboratory, 69117 Heidelberg, Germany
- Max Delbrück Centre for Molecular Medicine, 13125 Berlin, Germany
- Department of Bioinformatics, Biocenter, University of Würzburg, 97074 Würzburg, Germany
- Yonsei Frontier Lab (YFL), Yonsei University, 03722 Seoul, South Korea
| | - Daniel R Mende
- Department of Medical Microbiology, Amsterdam University Medical Centers, Amsterdam, The Netherlands
| |
Collapse
|
12
|
Goudey B, Geard N, Verspoor K, Zobel J. Propagation, detection and correction of errors using the sequence database network. Brief Bioinform 2022; 23:6764545. [PMID: 36266246 PMCID: PMC9677457 DOI: 10.1093/bib/bbac416] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2022] [Revised: 07/31/2022] [Accepted: 08/28/2022] [Indexed: 12/14/2022] Open
Abstract
Nucleotide and protein sequences stored in public databases are the cornerstone of many bioinformatics analyses. The records containing these sequences are prone to a wide range of errors, including incorrect functional annotation, sequence contamination and taxonomic misclassification. One source of information that can help to detect errors are the strong interdependency between records. Novel sequences in one database draw their annotations from existing records, may generate new records in multiple other locations and will have varying degrees of similarity with existing records across a range of attributes. A network perspective of these relationships between sequence records, within and across databases, offers new opportunities to detect-or even correct-erroneous entries and more broadly to make inferences about record quality. Here, we describe this novel perspective of sequence database records as a rich network, which we call the sequence database network, and illustrate the opportunities this perspective offers for quantification of database quality and detection of spurious entries. We provide an overview of the relevant databases and describe how the interdependencies between sequence records across these databases can be exploited by network analyses. We review the process of sequence annotation and provide a classification of sources of error, highlighting propagation as a major source. We illustrate the value of a network perspective through three case studies that use network analysis to detect errors, and explore the quality and quantity of critical relationships that would inform such network analyses. This systematic description of a network perspective of sequence database records provides a novel direction to combat the proliferation of errors within these critical bioinformatics resources.
Collapse
Affiliation(s)
- Benjamin Goudey
- Corresponding author. Benjamin Goudey, School of Computing and Information Systems, University of Melbourne Parkville, Victoria, 3010,
| | - Nicholas Geard
- School of Computing and Information Systems, University of Melbourne Parkville, Victoria, 3010
| | - Karin Verspoor
- School of Computing Technologies, RMIT University Melbourne, Victoria, 3000
| | - Justin Zobel
- School of Computing and Information Systems, University of Melbourne Parkville, Victoria, 3010
| |
Collapse
|
13
|
Rodríguez-Pérez H, Ciuffreda L, Flores C. NanoRTax, a real-time pipeline for taxonomic and diversity analysis of nanopore 16S rRNA amplicon sequencing data. Comput Struct Biotechnol J 2022; 20:5350-5354. [PMID: 36212537 PMCID: PMC9522874 DOI: 10.1016/j.csbj.2022.09.024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2022] [Revised: 09/16/2022] [Accepted: 09/16/2022] [Indexed: 11/05/2022] Open
Abstract
Background The study of microbial communities and their applications have been leveraged by advances in sequencing techniques and bioinformatics tools. The Oxford Nanopore Technologies long-read sequencing by nanopores provides a portable and cost-efficient platform for sequencing assays. While this opens the possibility of sequencing applications outside specialized environments and real-time analysis of data, complementing the existing efficient library preparation protocols with streamlined bioinformatic workflows is required. Results Here we present NanoRTax, a Nextflow pipeline for nanopore 16S rRNA gene amplicon data that features state-of-the-art taxonomic classification tools and real-time capability. The pipeline is paired with a web-based visual interface to enable user-friendly inspections of the experiment in progress. NanoRTax workflow and a simulated real-time analysis were used to validate the prediction of adult Intensive Care Unit patient mortality based on full-length 16S rRNA sequencing data from respiratory microbiome samples. Conclusions This constitutes a proof-of-concept simulation study of how real-time bioinformatic workflows could be used to shorten the turnaround times in critical care settings and provides an instrument for future research on early-response strategies for sepsis.
Collapse
Affiliation(s)
- Héctor Rodríguez-Pérez
- Research Unit, Hospital Universitario Nuestra Señora de Candelaria, Santa Cruz de Tenerife 38010, Spain
| | - Laura Ciuffreda
- Research Unit, Hospital Universitario Nuestra Señora de Candelaria, Santa Cruz de Tenerife 38010, Spain
| | - Carlos Flores
- Research Unit, Hospital Universitario Nuestra Señora de Candelaria, Santa Cruz de Tenerife 38010, Spain
- CIBER de Enfermedades Respiratorias, Instituto de Salud Carlos III, Madrid 28029, Spain
- Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), 38600 Granadilla, Santa Cruz de Tenerife, Spain
- Facultad de Ciencias de la Salud, Universidad Fernando de Pessoa Canarias, 35450 Las Palmas de Gran Canaria, Spain
| |
Collapse
|
14
|
Hassan Zada MS, Yuan B, Khan WA, Anjum A, Reiff-Marganiec S, Saleem R. A unified graph model based on molecular data binning for disease subtyping. J Biomed Inform 2022; 134:104187. [PMID: 36055637 DOI: 10.1016/j.jbi.2022.104187] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2022] [Revised: 08/05/2022] [Accepted: 08/25/2022] [Indexed: 11/19/2022]
Abstract
Molecular disease subtype discovery from omics data is an important research problem in precision medicine.The biggest challenges are the skewed distribution and data variability in the measurements of omics data. These challenges complicate the efficient identification of molecular disease subtypes defined by clinical differences, such as survival. Existing approaches adopt kernels to construct patient similarity graphs from each view through pairwise matching. However, the distance functions used in kernels are unable to utilize the potentially critical information of extreme values and data variability which leads to the lack of robustness. In this paper, a novel robust distance metric (ROMDEX) is proposed to construct similarity graphs for molecular disease subtypes from omics data, which is able to address the data variability and extreme values challenges. The proposed approach is validated on multiple TCGA cancer datasets, and the results are compared with multiple baseline disease subtyping methods. The evaluation of results is based on Kaplan-Meier survival time analysis, which is validated using statistical tests e.g, Cox-proportional hazard (Cox p-value). We reject the null hypothesis that the cohorts have the same hazard, for the P-values less than 0.05. The proposed approach achieved best P-values of 0.00181, 0.00171, and 0.00758 for Gene Expression, DNA Methylation, and MicroRNA data respectively, which shows significant difference in survival between the cohorts. In the results, the proposed approach outperformed the existing state-of-the-art (MRGC, PINS, SNF, Consensus Clustering and Icluster+) disease subtyping approaches on various individual disease views of multiple TCGA datasets.
Collapse
Affiliation(s)
| | - Bo Yuan
- School of Computing and Mathematical Sciences, University of Leicester, United Kingdom.
| | - Wajahat Ali Khan
- School of Computing and Engineering, University of Derby, United Kingdom.
| | - Ashiq Anjum
- School of Computing and Mathematical Sciences, University of Leicester, United Kingdom.
| | | | - Rabia Saleem
- School of Computing and Engineering, University of Derby, United Kingdom.
| |
Collapse
|
15
|
Liu J, Capurro D, Nguyen A, Verspoor K. "Note Bloat" impacts deep learning-based NLP models for clinical prediction tasks. J Biomed Inform 2022; 133:104149. [PMID: 35878821 DOI: 10.1016/j.jbi.2022.104149] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2022] [Revised: 05/28/2022] [Accepted: 07/19/2022] [Indexed: 10/17/2022]
Abstract
One unintended consequence of the Electronic Health Records (EHR) implementation is the overuse of content-importing technology, such as copy-and-paste, that creates "bloated" notes containing large amounts of textual redundancy. Despite the rising interest in applying machine learning models to learn from real-patient data, it is unclear how the phenomenon of note bloat might affect the Natural Language Processing (NLP) models derived from these notes. Therefore, in this work we examine the impact of redundancy on deep learning-based NLP models, considering four clinical prediction tasks using a publicly available EHR database. We applied two deduplication methods to the hospital notes, identifying large quantities of redundancy, and found that removing the redundancy usually has little negative impact on downstream performances, and can in certain circumstances assist models to achieve significantly better results. We also showed it is possible to attack model predictions by simply adding note duplicates, causing changes of correct predictions made by trained models into wrong predictions. In conclusion, we demonstrated that EHR text redundancy substantively affects NLP models for clinical prediction tasks, showing that the awareness of clinical contexts and robust modeling methods are important to create effective and reliable NLP systems in healthcare contexts.
Collapse
Affiliation(s)
- Jinghui Liu
- School of Computing and Information Systems, The University of Melbourne, Victoria, Australia; Australian e-Health Research Centre, CSIRO, Brisbane, Australia.
| | - Daniel Capurro
- School of Computing and Information Systems, The University of Melbourne, Victoria, Australia; Centre for Digital Transformation of Health, Melbourne Medical School, The University of Melbourne, Victoria, Australia.
| | - Anthony Nguyen
- Australian e-Health Research Centre, CSIRO, Brisbane, Australia.
| | - Karin Verspoor
- School of Computing and Information Systems, The University of Melbourne, Victoria, Australia; Centre for Digital Transformation of Health, Melbourne Medical School, The University of Melbourne, Victoria, Australia; School of Computing Technologies, RMIT University, Victoria, Australia.
| |
Collapse
|
16
|
Chen J, Goudey B, Zobel J, Geard N, Verspoor K. Exploring automatic inconsistency detection for literature-based gene ontology annotation. Bioinformatics 2022; 38:i273-i281. [PMID: 35758780 PMCID: PMC9235499 DOI: 10.1093/bioinformatics/btac230] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/08/2022] [Indexed: 11/12/2022] Open
Abstract
Motivation Literature-based gene ontology annotations (GOA) are biological database records that use controlled vocabulary to uniformly represent gene function information that is described in the primary literature. Assurance of the quality of GOA is crucial for supporting biological research. However, a range of different kinds of inconsistencies in between literature as evidence and annotated GO terms can be identified; these have not been systematically studied at record level. The existing manual-curation approach to GOA consistency assurance is inefficient and is unable to keep pace with the rate of updates to gene function knowledge. Automatic tools are therefore needed to assist with GOA consistency assurance. This article presents an exploration of different GOA inconsistencies and an early feasibility study of automatic inconsistency detection. Results We have created a reliable synthetic dataset to simulate four realistic types of GOA inconsistency in biological databases. Three automatic approaches are proposed. They provide reasonable performance on the task of distinguishing the four types of inconsistency and are directly applicable to detect inconsistencies in real-world GOA database records. Major challenges resulting from such inconsistencies in the context of several specific application settings are reported. This is the first study to introduce automatic approaches that are designed to address the challenges in current GOA quality assurance workflows. The data underlying this article are available in Github at https://github.com/jiyuc/AutoGOAConsistency.
Collapse
Affiliation(s)
- Jiyu Chen
- School of Computing and Information Systems, The University of Melbourne, Parkville, VIC 3010, Australia
| | - Benjamin Goudey
- School of Computing and Information Systems, The University of Melbourne, Parkville, VIC 3010, Australia
| | - Justin Zobel
- School of Computing and Information Systems, The University of Melbourne, Parkville, VIC 3010, Australia
| | - Nicholas Geard
- School of Computing and Information Systems, The University of Melbourne, Parkville, VIC 3010, Australia
| | - Karin Verspoor
- School of Computing and Information Systems, The University of Melbourne, Parkville, VIC 3010, Australia.,School of Computer Technologies, RMIT University, Melbourne, VIC 3000, Australia
| |
Collapse
|
17
|
Liao X, Hu K, Salhi A, Zou Y, Wang J, Gao X. msRepDB: a comprehensive repetitive sequence database of over 80 000 species. Nucleic Acids Res 2021; 50:D236-D245. [PMID: 34850956 PMCID: PMC8728181 DOI: 10.1093/nar/gkab1089] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2021] [Revised: 10/18/2021] [Accepted: 11/30/2021] [Indexed: 11/13/2022] Open
Abstract
Repeats are prevalent in the genomes of all bacteria, plants and animals, and they cover nearly half of the Human genome, which play indispensable roles in the evolution, inheritance, variation and genomic instability, and serve as substrates for chromosomal rearrangements that include disease-causing deletions, inversions, and translocations. Comprehensive identification, classification and annotation of repeats in genomes can provide accurate and targeted solutions towards understanding and diagnosis of complex diseases, optimization of plant properties and development of new drugs. RepBase and Dfam are two most frequently used repeat databases, but they are not sufficiently complete. Due to the lack of a comprehensive repeat database of multiple species, the current research in this field is far from being satisfactory. LongRepMarker is a new framework developed recently by our group for comprehensive identification of genomic repeats. We here propose msRepDB based on LongRepMarker, which is currently the most comprehensive multi-species repeat database, covering >80 000 species. Comprehensive evaluations show that msRepDB contains more species, and more complete repeats and families than RepBase and Dfam databases. (https://msrepdb.cbrc.kaust.edu.sa/pages/msRepDB/index.html).
Collapse
Affiliation(s)
- Xingyu Liao
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955, Saudi Arabia.,Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 410083, P.R. China
| | - Kang Hu
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 410083, P.R. China
| | - Adil Salhi
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955, Saudi Arabia
| | - You Zou
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 410083, P.R. China
| | - Jianxin Wang
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 410083, P.R. China
| | - Xin Gao
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955, Saudi Arabia
| |
Collapse
|
18
|
Filho JAF, Rosolen RR, Almeida DA, de Azevedo PHC, Motta MLL, Aono AH, dos Santos CA, Horta MAC, de Souza AP. Trends in biological data integration for the selection of enzymes and transcription factors related to cellulose and hemicellulose degradation in fungi. 3 Biotech 2021; 11:475. [PMID: 34777932 PMCID: PMC8548487 DOI: 10.1007/s13205-021-03032-y] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2021] [Accepted: 10/15/2021] [Indexed: 12/13/2022] Open
Abstract
Fungi are key players in biotechnological applications. Although several studies focusing on fungal diversity and genetics have been performed, many details of fungal biology remain unknown, including how cellulolytic enzymes are modulated within these organisms to allow changes in main plant cell wall compounds, cellulose and hemicellulose, and subsequent biomass conversion. With the advent and consolidation of DNA/RNA sequencing technology, different types of information can be generated at the genomic, structural and functional levels, including the gene expression profiles and regulatory mechanisms of these organisms, during degradation-induced conditions. This increase in data generation made rapid computational development necessary to deal with the large amounts of data generated. In this context, the origination of bioinformatics, a hybrid science integrating biological data with various techniques for information storage, distribution and analysis, was a fundamental step toward the current state-of-the-art in the postgenomic era. The possibility of integrating biological big data has facilitated exciting discoveries, including identifying novel mechanisms and more efficient enzymes, increasing yields, reducing costs and expanding opportunities in the bioprocess field. In this review, we summarize the current status and trends of the integration of different types of biological data through bioinformatics approaches for biological data analysis and enzyme selection.
Collapse
Affiliation(s)
- Jaire A. Ferreira Filho
- Center for Molecular Biology and Genetic Engineering (CBMEG), University of Campinas (UNICAMP), Campinas, SP Brazil
| | - Rafaela R. Rosolen
- Center for Molecular Biology and Genetic Engineering (CBMEG), University of Campinas (UNICAMP), Campinas, SP Brazil
| | - Deborah A. Almeida
- Center for Molecular Biology and Genetic Engineering (CBMEG), University of Campinas (UNICAMP), Campinas, SP Brazil
| | - Paulo Henrique C. de Azevedo
- Center for Molecular Biology and Genetic Engineering (CBMEG), University of Campinas (UNICAMP), Campinas, SP Brazil
| | - Maria Lorenza L. Motta
- Center for Molecular Biology and Genetic Engineering (CBMEG), University of Campinas (UNICAMP), Campinas, SP Brazil
| | - Alexandre H. Aono
- Center for Molecular Biology and Genetic Engineering (CBMEG), University of Campinas (UNICAMP), Campinas, SP Brazil
| | - Clelton A. dos Santos
- Center for Molecular Biology and Genetic Engineering (CBMEG), University of Campinas (UNICAMP), Campinas, SP Brazil
- Brazilian Biorenewables National Laboratory (LNBR), Brazilian Center for Research in Energy and Materials (CNPEM), Campinas, SP Brazil
| | - Maria Augusta C. Horta
- Center for Molecular Biology and Genetic Engineering (CBMEG), University of Campinas (UNICAMP), Campinas, SP Brazil
- Faculty of Pharmaceutical Sciences of Ribeirão Preto, University of São Paulo, Ribeirão Preto, SP Brazil
| | - Anete P. de Souza
- Center for Molecular Biology and Genetic Engineering (CBMEG), University of Campinas (UNICAMP), Campinas, SP Brazil
- Department of Plant Biology, Institute of Biology, UNICAMP, Universidade Estadual de Campinas, Campinas, SP 13083-875 Brazil
| |
Collapse
|
19
|
James SA, Ong HS, Hari R, Khan AM. A systematic bioinformatics approach for large-scale identification and characterization of host-pathogen shared sequences. BMC Genomics 2021; 22:700. [PMID: 34583643 PMCID: PMC8477458 DOI: 10.1186/s12864-021-07657-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2021] [Accepted: 04/28/2021] [Indexed: 11/10/2022] Open
Abstract
Background Biology has entered the era of big data with the advent of high-throughput omics technologies. Biological databases provide public access to petabytes of data and information facilitating knowledge discovery. Over the years, sequence data of pathogens has seen a large increase in the number of records, given the relatively small genome size and their important role as infectious and symbiotic agents. Humans are host to numerous pathogenic diseases, such as that by viruses, many of which are responsible for high mortality and morbidity. The interaction between pathogens and humans over the evolutionary history has resulted in sharing of sequences, with important biological and evolutionary implications. Results This study describes a large-scale, systematic bioinformatics approach for identification and characterization of shared sequences between the host and pathogen. An application of the approach is demonstrated through identification and characterization of the Flaviviridae-human share-ome. A total of 2430 nonamers represented the Flaviviridae-human share-ome with 100% identity. Although the share-ome represented a small fraction of the repertoire of Flaviviridae (~ 0.12%) and human (~ 0.013%) non-redundant nonamers, the 2430 shared nonamers mapped to 16,946 Flaviviridae and 7506 human non-redundant protein sequences. The shared nonamer sequences mapped to 125 species of Flaviviridae, including several with unclassified genus. The majority (~ 68%) of the shared sequences mapped to Hepacivirus C species; West Nile, dengue and Zika viruses of the Flavivirus genus accounted for ~ 11%, ~ 7%, and ~ 3%, respectively, of the Flaviviridae protein sequences (16,946) mapped by the share-ome. Further characterization of the share-ome provided important structural-functional insights to Flaviviridae-human interactions. Conclusion Mapping of the host-pathogen share-ome has important implications for the design of vaccines and drugs, diagnostics, disease surveillance and the discovery of unknown, potential host-pathogen interactions. The generic workflow presented herein is potentially applicable to a variety of pathogens, such as of viral, bacterial or parasitic origin. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-021-07657-4.
Collapse
Affiliation(s)
- Stephen Among James
- Centre for Bioinformatics, School of Data Sciences, Perdana University, Damansara Heights, Kuala Lumpur, 50490, Malaysia.,Department of Biochemistry, Faculty of Science, Kaduna State University, Kaduna, 800211, Nigeria
| | - Hui San Ong
- Centre for Bioinformatics, School of Data Sciences, Perdana University, Damansara Heights, Kuala Lumpur, 50490, Malaysia
| | - Ranjeev Hari
- Centre for Bioinformatics, School of Data Sciences, Perdana University, Damansara Heights, Kuala Lumpur, 50490, Malaysia
| | - Asif M Khan
- Centre for Bioinformatics, School of Data Sciences, Perdana University, Damansara Heights, Kuala Lumpur, 50490, Malaysia. .,Beykoz Institute of Life Sciences and Biotechnology, Bezmialem Vakif University, Beykoz, Istanbul, 34820, Turkey.
| |
Collapse
|
20
|
Giusti A, Ricci E, Gasperetti L, Galgani M, Polidori L, Verdigi F, Narducci R, Armani A. Building of an Internal Transcribed Spacer (ITS) Gene Dataset to Support the Italian Health Service in Mushroom Identification. Foods 2021; 10:foods10061193. [PMID: 34070525 PMCID: PMC8227961 DOI: 10.3390/foods10061193] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2021] [Revised: 05/23/2021] [Accepted: 05/24/2021] [Indexed: 01/26/2023] Open
Abstract
This study aims at building an ITS gene dataset to support the Italian Health Service in mushroom identification. The target species were selected among those mostly involved in regional (Tuscany) poisoning cases. For each target species, all the ITS sequences already deposited in GenBank and BOLD databases were retrieved and accurately assessed for quality and reliability by a systematic filtering process. Wild specimens of target species were also collected to produce reference ITS sequences. These were used partly to set up and partly to validate the dataset by BLAST analysis. Overall, 7270 sequences were found in the two databases. After filtering, 1293 sequences (17.8%) were discarded, with a final retrieval of 5977 sequences. Ninety-seven ITS reference sequences were obtained from 76 collected mushroom specimens: 15 of them, obtained from 10 species with no sequences available after the filtering, were used to build the dataset, with a final taxonomic coverage of 96.7%. The other 82 sequences (66 species) were used for the dataset validation. In most of the cases (n = 71; 86.6%) they matched with identity values ≥ 97–100% with the corresponding species. The dataset was able to identify the species involved in regional poisoning incidents. As some of these species are also involved in poisonings at the national level, the dataset may be used for supporting the National Health Service throughout the Italian territory. Moreover, it can support the official control activities aimed at detecting frauds in commercial mushroom-based products and safeguarding consumers.
Collapse
Affiliation(s)
- Alice Giusti
- FishLab, Department of Veterinary Sciences, University of Pisa, Viale delle Piagge 2, 56124 Pisa, Italy; (M.G.); (A.A.)
- Correspondence: ; Tel.: +39-0502210204
| | - Enrica Ricci
- Experimental Zooprophylactic Institute of Lazio and Tuscany M. Aleandri, UOT Toscana Nord, SS Abetone e Brennero 4, 56124 Pisa, Italy; (E.R.); (L.G.)
| | - Laura Gasperetti
- Experimental Zooprophylactic Institute of Lazio and Tuscany M. Aleandri, UOT Toscana Nord, SS Abetone e Brennero 4, 56124 Pisa, Italy; (E.R.); (L.G.)
| | - Marta Galgani
- FishLab, Department of Veterinary Sciences, University of Pisa, Viale delle Piagge 2, 56124 Pisa, Italy; (M.G.); (A.A.)
| | - Luca Polidori
- Tuscany Mycological Groups Association, via Turi, 8 Santa Croce sull’Arno, 56124 Pisa, Italy; (L.P.); (R.N.)
| | - Francesco Verdigi
- North West Tuscany LHA (Mycological Inspectorate), via A. Cocchi, 7/9, 56124 Pisa, Italy;
| | - Roberto Narducci
- Tuscany Mycological Groups Association, via Turi, 8 Santa Croce sull’Arno, 56124 Pisa, Italy; (L.P.); (R.N.)
| | - Andrea Armani
- FishLab, Department of Veterinary Sciences, University of Pisa, Viale delle Piagge 2, 56124 Pisa, Italy; (M.G.); (A.A.)
| |
Collapse
|
21
|
Ciuffreda L, Rodríguez-Pérez H, Flores C. Nanopore sequencing and its application to the study of microbial communities. Comput Struct Biotechnol J 2021; 19:1497-1511. [PMID: 33815688 PMCID: PMC7985215 DOI: 10.1016/j.csbj.2021.02.020] [Citation(s) in RCA: 105] [Impact Index Per Article: 26.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2020] [Revised: 02/24/2021] [Accepted: 02/27/2021] [Indexed: 12/14/2022] Open
Abstract
Since its introduction, nanopore sequencing has enhanced our ability to study complex microbial samples through the possibility to sequence long reads in real time using inexpensive and portable technologies. The use of long reads has allowed to address several previously unsolved issues in the field, such as the resolution of complex genomic structures, and facilitated the access to metagenome assembled genomes (MAGs). Furthermore, the low cost and portability of platforms together with the development of rapid protocols and analysis pipelines have featured nanopore technology as an attractive and ever-growing tool for real-time in-field sequencing for environmental microbial analysis. This review provides an up-to-date summary of the experimental protocols and bioinformatic tools for the study of microbial communities using nanopore sequencing, highlighting the most important and recent research in the field with a major focus on infectious diseases. An overview of the main approaches including targeted and shotgun approaches, metatranscriptomics, epigenomics, and epitranscriptomics is provided, together with an outlook to the major challenges and perspectives over the use of this technology for microbial studies.
Collapse
Affiliation(s)
- Laura Ciuffreda
- Research Unit, Hospital Universitario N.S. de Candelaria, Universidad de La Laguna, 38010 Santa Cruz de Tenerife, Spain
| | - Héctor Rodríguez-Pérez
- Research Unit, Hospital Universitario N.S. de Candelaria, Universidad de La Laguna, 38010 Santa Cruz de Tenerife, Spain
| | - Carlos Flores
- Research Unit, Hospital Universitario N.S. de Candelaria, Universidad de La Laguna, 38010 Santa Cruz de Tenerife, Spain
- CIBER de Enfermedades Respiratorias, Instituto de Salud Carlos III, 28029 Madrid, Spain
- Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), 38600 Santa Cruz de Tenerife, Spain
- Instituto de Tecnologías Biomédicas (ITB), Universidad de La Laguna, 38200 Santa Cruz de Tenerife, Spain
| |
Collapse
|
22
|
Quality Matters: Biocuration Experts on the Impact of Duplication and Other Data Quality Issues in Biological Databases. GENOMICS PROTEOMICS & BIOINFORMATICS 2020; 18:91-103. [PMID: 32652120 PMCID: PMC7646089 DOI: 10.1016/j.gpb.2018.11.006] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/08/2017] [Revised: 10/24/2018] [Accepted: 12/14/2018] [Indexed: 11/27/2022]
|
23
|
Chen Q, Du J, Kim S, Wilbur WJ, Lu Z. Deep learning with sentence embeddings pre-trained on biomedical corpora improves the performance of finding similar sentences in electronic medical records. BMC Med Inform Decis Mak 2020; 20:73. [PMID: 32349758 PMCID: PMC7191680 DOI: 10.1186/s12911-020-1044-0] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
Background Capturing sentence semantics plays a vital role in a range of text mining applications. Despite continuous efforts on the development of related datasets and models in the general domain, both datasets and models are limited in biomedical and clinical domains. The BioCreative/OHNLP2018 organizers have made the first attempt to annotate 1068 sentence pairs from clinical notes and have called for a community effort to tackle the Semantic Textual Similarity (BioCreative/OHNLP STS) challenge. Methods We developed models using traditional machine learning and deep learning approaches. For the post challenge, we focused on two models: the Random Forest and the Encoder Network. We applied sentence embeddings pre-trained on PubMed abstracts and MIMIC-III clinical notes and updated the Random Forest and the Encoder Network accordingly. Results The official results demonstrated our best submission was the ensemble of eight models. It achieved a Person correlation coefficient of 0.8328 – the highest performance among 13 submissions from 4 teams. For the post challenge, the performance of both Random Forest and the Encoder Network was improved; in particular, the correlation of the Encoder Network was improved by ~ 13%. During the challenge task, no end-to-end deep learning models had better performance than machine learning models that take manually-crafted features. In contrast, with the sentence embeddings pre-trained on biomedical corpora, the Encoder Network now achieves a correlation of ~ 0.84, which is higher than the original best model. The ensembled model taking the improved versions of the Random Forest and Encoder Network as inputs further increased performance to 0.8528. Conclusions Deep learning models with sentence embeddings pre-trained on biomedical corpora achieve the highest performance on the test set. Through error analysis, we find that end-to-end deep learning models and traditional machine learning models with manually-crafted features complement each other by finding different types of sentences. We suggest a combination of these models can better find similar sentences in practice.
Collapse
Affiliation(s)
- Qingyu Chen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, USA
| | - Jingcheng Du
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, USA.,School of Biomedical Informatics, UTHealth, Houston, USA
| | - Sun Kim
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, USA
| | - W John Wilbur
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, USA.
| |
Collapse
|
24
|
Mende DR, Letunic I, Maistrenko OM, Schmidt TSB, Milanese A, Paoli L, Hernández-Plaza A, Orakov AN, Forslund SK, Sunagawa S, Zeller G, Huerta-Cepas J, Coelho LP, Bork P. proGenomes2: an improved database for accurate and consistent habitat, taxonomic and functional annotations of prokaryotic genomes. Nucleic Acids Res 2020; 48:D621-D625. [PMID: 31647096 PMCID: PMC7145564 DOI: 10.1093/nar/gkz1002] [Citation(s) in RCA: 41] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2019] [Revised: 10/15/2019] [Accepted: 10/18/2019] [Indexed: 12/16/2022] Open
Abstract
Microbiology depends on the availability of annotated microbial genomes for many applications. Comparative genomics approaches have been a major advance, but consistent and accurate annotations of genomes can be hard to obtain. In addition, newer concepts such as the pan-genome concept are still being implemented to help answer biological questions. Hence, we present proGenomes2, which provides 87 920 high-quality genomes in a user-friendly and interactive manner. Genome sequences and annotations can be retrieved individually or by taxonomic clade. Every genome in the database has been assigned to a species cluster and most genomes could be accurately assigned to one or multiple habitats. In addition, general functional annotations and specific annotations of antibiotic resistance genes and single nucleotide variants are provided. In short, proGenomes2 provides threefold more genomes, enhanced habitat annotations, updated taxonomic and functional annotation and improved linkage to the NCBI BioSample database. The database is available at http://progenomes.embl.de/.
Collapse
Affiliation(s)
- Daniel R Mende
- Department of Medical Microbiology, Academic Medical Centre, University of Amsterdam, Amsterdam, The Netherlands
| | - Ivica Letunic
- Biobyte solutions GmbH, Bothestr, 142, 69117 Heidelberg, Germany
| | - Oleksandr M Maistrenko
- Structural and Computational Biology Unit, European Molecular Biology Laboratory, 69117 Heidelberg, Germany
| | - Thomas S B Schmidt
- Structural and Computational Biology Unit, European Molecular Biology Laboratory, 69117 Heidelberg, Germany
| | - Alessio Milanese
- Structural and Computational Biology Unit, European Molecular Biology Laboratory, 69117 Heidelberg, Germany
| | - Lucas Paoli
- Institute of Microbiology, Department of Biology, ETH Zurich, Vladimir-Prelog-Weg 4, 8093 Zurich, Switzerland
| | - Ana Hernández-Plaza
- Centro de Biotecnología y Genómica de Plantas, Universidad Politécnica de Madrid (UPM) - Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA), Campus de Montegancedo-UPM, 28223, Pozuelo de Alarcón, Madrid, Spain
| | - Askarbek N Orakov
- Structural and Computational Biology Unit, European Molecular Biology Laboratory, 69117 Heidelberg, Germany
| | - Sofia K Forslund
- Max Delbrück Centre for Molecular Medicine, 13125 Berlin, Germany
| | - Shinichi Sunagawa
- Institute of Microbiology, Department of Biology, ETH Zurich, Vladimir-Prelog-Weg 4, 8093 Zurich, Switzerland
| | - Georg Zeller
- Structural and Computational Biology Unit, European Molecular Biology Laboratory, 69117 Heidelberg, Germany
| | - Jaime Huerta-Cepas
- Centro de Biotecnología y Genómica de Plantas, Universidad Politécnica de Madrid (UPM) - Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA), Campus de Montegancedo-UPM, 28223, Pozuelo de Alarcón, Madrid, Spain
| | - Luis Pedro Coelho
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China
- Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, China
| | - Peer Bork
- Structural and Computational Biology Unit, European Molecular Biology Laboratory, 69117 Heidelberg, Germany
- Max Delbrück Centre for Molecular Medicine, 13125 Berlin, Germany
- Molecular Medicine Partnership Unit, University of Heidelberg and European Molecular Biology Laboratory, 69120 Heidelberg, Germany
- Department of Bioinformatics, Biocenter, University of Würzburg, 97074 Würzburg, Germany
| |
Collapse
|
25
|
A pruning strategy to improve pairwise comparison-based near-duplicate detection. Knowl Inf Syst 2019. [DOI: 10.1007/s10115-018-1299-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
26
|
Mier P, Andrade-Navarro MA. Toward completion of the Earth's proteome: an update a decade later. Brief Bioinform 2019; 20:463-470. [PMID: 29040399 DOI: 10.1093/bib/bbx127] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2017] [Revised: 09/08/2017] [Indexed: 12/13/2022] Open
Abstract
Protein databases are steadily growing driven by the spread of new more efficient sequencing techniques. This growth is dominated by an increase in redundancy (homologous proteins with various degrees of sequence similarity) and by the incapability to process and curate sequence entries as fast as they are created. To understand these trends and aid bioinformatic resources that might be compromised by the increasing size of the protein sequence databases, we have created a less-redundant protein data set. In parallel, we analyzed the evolution of protein sequence databases in terms of size and redundancy. While the SwissProt database has decelerated its growth mostly because of a focus on increasing the level of annotation of its sequences, its counterpart TrEMBL, much less limited by curation steps, is still in a phase of accelerated growth. However, we predict that before 2020, almost all entries deposited in UniProtKB will be homologous to known proteins. We propose that new sequencing projects can be made more useful if they are driven to sequencing voids, parts of the tree of life far from already sequenced species or model organisms. We show these voids are present in the Archaea and Eukarya domains of life. The approach to the certainty of the redundancy of new protein sequence entries leads to the consideration that most of the protein diversity on Earth has already been described, which we estimate to be of around 3.75 million proteins, revising down the prediction we did a decade ago.
Collapse
Affiliation(s)
- Pablo Mier
- Faculty of Biology, Johannes Gutenberg University Mainz, Gresemundweg, Mainz, Germany
| | | |
Collapse
|
27
|
Abstract
Android ransomware is one of the most threatening attacks nowadays. Ransomware in general encrypts or locks the files on the victim’s device and requests a payment in order to recover them. The available technologies are not enough as new ransomwares employ a combination of techniques to evade anti-virus detection. Moreover, the literature counts only a few studies that have proposed static and/or dynamic approaches to detect Android ransomware in particular. Additionally, there are plenty of open-source malware datasets; however, the research community is still lacking ransomware datasets. In this paper, the state-of-the-art of Android ransomware detection approaches were investigated. A deep comparative analysis was conducted which shed the key differences among the existing solutions. An application programming interface (API)-based ransomware detection system (API-RDS) was proposed to provide a static analysis paradigm for detecting Android ransomware apps. API-RDS focuses on examining API packages’ calls as leading indicator of ransomware activity to discriminate ransomware with high accuracy before it harms the user’s device. API packages’ calls of both benign and ransomware apps were thoroughly analyzed and compared. Significant API packages with corresponding methods were identified. The experimental results show that API-RDS outperformed other recent related approaches. API-RDS achieved 97% accuracy while reducing the complexity of the classification model by 26% due to features reduction. Moreover, this research designed a proactive mechanism based on a high quality unique ransomware dataset without duplicated samples. 2959 ransomware samples were collected, tested and reduced by almost 83% due to samples duplication. This research also contributes to constructing an up-to-date, unique dataset that covers the majority of existing Android ransomware families and recent clean apps that could be used as a labeled reference for research community.
Collapse
|
28
|
Bouadjenek MR, Zobel J, Verspoor K. Automated assessment of biological database assertions using the scientific literature. BMC Bioinformatics 2019; 20:216. [PMID: 31035936 PMCID: PMC6489365 DOI: 10.1186/s12859-019-2801-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2018] [Accepted: 04/09/2019] [Indexed: 12/27/2022] Open
Abstract
BACKGROUND The large biological databases such as GenBank contain vast numbers of records, the content of which is substantively based on external resources, including published literature. Manual curation is used to establish whether the literature and the records are indeed consistent. We explore in this paper an automated method for assessing the consistency of biological assertions, to assist biocurators, which we call BARC, Biocuration tool for Assessment of Relation Consistency. In this method a biological assertion is represented as a relation between two objects (for example, a gene and a disease); we then use our novel set-based relevance algorithm SaBRA to retrieve pertinent literature, and apply a classifier to estimate the likelihood that this relation (assertion) is correct. RESULTS Our experiments on assessing gene-disease relations and protein-protein interactions using the PubMed Central collection show that BARC can be effective at assisting curators to perform data cleansing. Specifically, the results obtained showed that BARC substantially outperforms the best baselines, with an improvement of F-measure of 3.5% and 13%, respectively, on gene-disease relations and protein-protein interactions. We have additionally carried out a feature analysis that showed that all feature types are informative, as are all fields of the documents. CONCLUSIONS BARC provides a clear benefit for the biocuration community, as there are no prior automated tools for identifying inconsistent assertions in large-scale biological databases.
Collapse
Affiliation(s)
- Mohamed Reda Bouadjenek
- Department of Mechanical & Industrial Engineering, University of Toronto, Toronto, M5S 3G8 Canada
| | - Justin Zobel
- School of Computing and Information Systems, University of Melbourne, Melbourne, 3010 Australia
| | - Karin Verspoor
- School of Computing and Information Systems, University of Melbourne, Melbourne, 3010 Australia
| |
Collapse
|
29
|
Chen Q, Zhang X, Wan Y, Zobel J, Verspoor K. Search Effectiveness in Nonredundant Sequence Databases: Assessments and Solutions. J Comput Biol 2018; 26:605-617. [PMID: 30585742 DOI: 10.1089/cmb.2018.0198] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022] Open
Abstract
Duplicate sequence records-that is, records having similar or identical sequences-are a challenge in search of biological sequence databases. They significantly increase database search time and can lead to uninformative search results containing similar sequences. Sequence clustering methods have been used to address this issue to group similar sequences into clusters. These clusters form a nonredundant database consisting of representatives (one record per cluster) and members (the remaining records in a cluster). In this approach, for nonredundant database search, users search against representatives first and optionally expand search results by exploring member records from matching clusters. Existing studies used Precision and Recall to assess the search effectiveness of nonredundant databases. However, the use of Precision and Recall does not model user behavior in practice and thus may not reflect practical search effectiveness. In this study, we first propose innovative evaluation metrics to measure search effectiveness. The findings are that (1) the Precision of expanded sets is consistently lower than that of representatives, with a decrease up to 7% at top ranks; and (2) Recall is uninformative because, for most queries, expanded sets return more records than does search of the original unclustered databases. Motivated by these findings, we propose a solution that returns a user-specified proportion of top similar records, modeled by a ranking function that aggregates sequence and annotation similarities. In experiments undertaken on UniProtKB/Swiss-Prot, the largest expert-curated protein database, we show that our method dramatically reduces the number of returned sequences, increases Precision by 3%, and does not impact effective search time.
Collapse
Affiliation(s)
- Qingyu Chen
- 1 School of Computing and Information Systems, The University of Melbourne, Parkville, Australia
| | - Xiuzhen Zhang
- 2 School of Science, RMIT University, Melbourne, Australia
| | - Yu Wan
- 3 Department of Biochemistry and Molecular Biology, Bio21 Molecular Science and Biotechnology Institute, The University of Melbourne, Parkville, Australia
| | - Justin Zobel
- 1 School of Computing and Information Systems, The University of Melbourne, Parkville, Australia
| | - Karin Verspoor
- 1 School of Computing and Information Systems, The University of Melbourne, Parkville, Australia
| |
Collapse
|
30
|
Chen H, Zhang Z, Peng W. miRDDCR: a miRNA-based method to comprehensively infer drug-disease causal relationships. Sci Rep 2017; 7:15921. [PMID: 29162848 PMCID: PMC5698443 DOI: 10.1038/s41598-017-15716-8] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2017] [Accepted: 10/31/2017] [Indexed: 01/10/2023] Open
Abstract
Revealing the cause-and-effect mechanism behind drug-disease relationships remains a challenging task. Recent studies suggested that drugs can target microRNAs (miRNAs) and alter their expression levels. In the meanwhile, the inappropriate expression of miRNAs will lead to various diseases. Therefore, targeting specific miRNAs by small-molecule drugs to modulate their activities provides a promising approach to human disease treatment. However, few studies attempt to discover drug-disease causal relationships through the molecular level of miRNAs. Here, we developed a miRNA-based inference method miRDDCR to comprehensively predict drug-disease causal relationships. We first constructed a three-layer drug-miRNA-disease heterogeneous network by combining similarity measurements, existing drug-miRNA associations and miRNA-disease associations. Then, we extended the algorithm of Random Walk to the three-layer heterogeneous network and ranked the potential indications for drugs. Leave-one-out cross-validations and case studies demonstrated that our method miRDDCR can achieve excellent prediction power. Compared with related methods, our causality discovery-based algorithm showed superior prediction ability and highlighted the molecular basis miRNAs, which can be used to assist in the experimental design for drug development and disease treatment. Finally, comprehensively inferred drug-disease causal relationships were released for further studies.
Collapse
Affiliation(s)
- Hailin Chen
- School of Software, East China Jiaotong University, Nanchang, China.
| | - Zuping Zhang
- School of Information Science and Engineering, Central South University, Changsha, China
| | - Wei Peng
- Computer Center of Kunming University of Science and Technology, Kunming, China
| |
Collapse
|
31
|
Bouadjenek MR, Verspoor K, Zobel J. Automated detection of records in biological sequence databases that are inconsistent with the literature. J Biomed Inform 2017. [PMID: 28624643 DOI: 10.1016/j.jbi.2017.06.015] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
Abstract
We investigate and analyse the data quality of nucleotide sequence databases with the objective of automatic detection of data anomalies and suspicious records. Specifically, we demonstrate that the published literature associated with each data record can be used to automatically evaluate its quality, by cross-checking the consistency of the key content of the database record with the referenced publications. Focusing on GenBank, we describe a set of quality indicators based on the relevance paradigm of information retrieval (IR). Then, we use these quality indicators to train an anomaly detection algorithm to classify records as "confident" or "suspicious". Our experiments on the PubMed Central collection show assessing the coherence between the literature and database records, through our algorithms, is an effective mechanism for assisting curators to perform data cleansing. Although fewer than 0.25% of the records in our data set are known to be faulty, we would expect that there are many more in GenBank that have not yet been identified. By automated comparison with literature they can be identified with a precision of up to 10% and a recall of up to 30%, while strongly outperforming several baselines. While these results leave substantial room for improvement, they reflect both the very imbalanced nature of the data, and the limited explicitly labelled data that is available. Overall, the obtained results show promise for the development of a new kind of approach to detecting low-quality and suspicious sequence records based on literature analysis and consistency. From a practical point of view, this will greatly help curators in identifying inconsistent records in large-scale sequence databases by highlighting records that are likely to be inconsistent with the literature.
Collapse
Affiliation(s)
- Mohamed Reda Bouadjenek
- Department of Computing and Information Systems, The University of Melbourne, Parkville 3053, Australia.
| | - Karin Verspoor
- Department of Computing and Information Systems, The University of Melbourne, Parkville 3053, Australia.
| | - Justin Zobel
- Department of Computing and Information Systems, The University of Melbourne, Parkville 3053, Australia.
| |
Collapse
|