1
|
Zhu J, Che C, Jiang H, Xu J, Yin J, Zhong Z. SSF-DDI: a deep learning method utilizing drug sequence and substructure features for drug-drug interaction prediction. BMC Bioinformatics 2024; 25:39. [PMID: 38262923 PMCID: PMC10810255 DOI: 10.1186/s12859-024-05654-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2023] [Accepted: 01/12/2024] [Indexed: 01/25/2024] Open
Abstract
BACKGROUND Drug-drug interactions (DDI) are prevalent in combination therapy, necessitating the importance of identifying and predicting potential DDI. While various artificial intelligence methods can predict and identify potential DDI, they often overlook the sequence information of drug molecules and fail to comprehensively consider the contribution of molecular substructures to DDI. RESULTS In this paper, we proposed a novel model for DDI prediction based on sequence and substructure features (SSF-DDI) to address these issues. Our model integrates drug sequence features and structural features from the drug molecule graph, providing enhanced information for DDI prediction and enabling a more comprehensive and accurate representation of drug molecules. CONCLUSION The results of experiments and case studies have demonstrated that SSF-DDI significantly outperforms state-of-the-art DDI prediction models across multiple real datasets and settings. SSF-DDI performs better in predicting DDI involving unknown drugs, resulting in a 5.67% improvement in accuracy compared to state-of-the-art methods.
Collapse
Affiliation(s)
- Jing Zhu
- Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, Dalian University, Dalian, 116000, China
| | - Chao Che
- School of Software Engineering, Dalian University, Dalian, 116000, China
| | - Hao Jiang
- Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, Dalian University, Dalian, 116000, China
| | - Jian Xu
- General Surgery, Affiliated Zhongshan Hospital of Dalian University, Dalian, 116000, China
| | - Jiajun Yin
- General Surgery, Affiliated Zhongshan Hospital of Dalian University, Dalian, 116000, China
| | - Zhaoqian Zhong
- Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, Dalian University, Dalian, 116000, China.
| |
Collapse
|
2
|
Zhang X, Chen S, Zhao Z, Ma C, Liu Y. Investigation of B-atp6-orfH79 distributing in Chinese populations of Oryza rufipogon and analysis of its chimeric structure. BMC Plant Biol 2023; 23:81. [PMID: 36750954 PMCID: PMC9903446 DOI: 10.1186/s12870-023-04082-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/24/2022] [Accepted: 01/23/2023] [Indexed: 06/18/2023]
Abstract
BACKGROUND The cytoplasmic male sterility (CMS) of rice is caused by chimeric mitochondrial DNA (mtDNA) that is maternally inherited in the majority of multicellular organisms. Wild rice (Oryza rufipogon Griff.) has been regarded as the ancestral progenitor of Asian cultivated rice (Oryza sativa L.). To investigate the distribution of original CMS source, and explore the origin of gametophytic CMS gene, a total of 427 individuals with seventeen representative populations of O. rufipogon were collected in from Dongxiang of Jiangxi Province to Sanya of Hainan Province, China, for the PCR amplification of atp6, orfH79 and B-atp6-orfH79, respectively. RESULTS The B-atp6-orfH79 and its variants (B-atp6-GSV) were detected in five among seventeen populations (i.e. HK, GZ, PS, TL and YJ) through PCR amplification, which could be divided into three haplotypes, i.e., BH1, BH2, and BH3. The BH2 haplotype was identical to B-atp6-orfH79, while the BH1 and BH3 were the novel haplotypes of B-atp6-GSV. Combined with the high-homology sequences in GenBank, a total of eighteen haplotypes have been revealed, only with ten haplotypes in orfH79 and its variants (GSV) that belong to three species (i.e. O. rufipogon, Oryza nivara and Oryza sativa). Enough haplotypes clearly demonstrated the uniform structural characteristics of the B-atp6-orfH79 as follows: except for the conserved sequence (671 bp) composed of B-atp6 (619 bp) and the downstream followed the B-atp6 (52 bp, DS), and GSV sequence, a rich variable sequence (VS, 176 bp) lies between the DS and GSV with five insertion or deletion and more than 30 single nucleotide polymorphism. Maximum likelihood analysis showed that eighteen haplotypes formed three clades with high support rate. The hierarchical analysis of molecular variance (AMOVA) indicated the occurrence of variation among all populations (FST = 1; P < 0.001), which implied that the chimeric structure occurred independently. Three haplotypes (i.e., H1, H2 and H3) were detected by the primer of orfH79, which were identical to the GVS in B-atp6-GVS structure, respectively. All seventeen haplotypes of the orfH79, belonged to six species based on our results and the existing references. Seven existed single nucleotide polymorphism in GSV section can be translated into eleven various amino acid sequences. CONCLUSIONS Generally, this study, indicating that orfH79 was always accompanied by the B-atp6, not only provide two original CMS sources for rice breeding, but also confirm the uniform structure of B-atp-orfH79, which contribute to revealing the origin of rice gametophytic CMS genes, and the reason about frequent recombination of mitochondrial DNA.
Collapse
Affiliation(s)
- Xuemei Zhang
- State Key Laboratory of Conservation and Utilization of Bio-Resources in Yunnan, The Key Laboratory of Medicinal Plant Biology of Yunnan Province, Yunnan Agricultural University, Kunming, 650201, Yunnan, China
- College of Agronomy and Biotechnology, Yunnan Agricultural University, Kunming, 650201, Yunnan, China
| | - Shuying Chen
- College of Agronomy and Biotechnology, Yunnan Agricultural University, Kunming, 650201, Yunnan, China
| | - Zixian Zhao
- College of Agronomy and Biotechnology, Yunnan Agricultural University, Kunming, 650201, Yunnan, China
| | - Cunqiang Ma
- College of Horticulture, Nanjing Agricultural University, Nanjing, 210095, Jiangsu, China.
| | - Yating Liu
- College of Agronomy and Biotechnology, Yunnan Agricultural University, Kunming, 650201, Yunnan, China.
- College of Tobacco, Yunnan Agricultural University, Kunming, 650201, Yunnan, China.
| |
Collapse
|
3
|
Wang Y, Zhu X, Yang L, Hu X, He K, Yu C, Jiao S, Chen J, Guo R, Yang S. IDDLncLoc: Subcellular Localization of LncRNAs Based on a Framework for Imbalanced Data Distributions. Interdiscip Sci 2022; 14:409-420. [PMID: 35192174 DOI: 10.1007/s12539-021-00497-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2021] [Revised: 12/16/2021] [Accepted: 12/20/2021] [Indexed: 06/14/2023]
Abstract
Long non-coding RNAs play a crucial role in many life processes of cell, such as genetic markers, RNA splicing, signaling, and protein regulation. Considering that identifying lncRNA's localization in the cell through experimental methods is complicated, hard to reproduce, and expensive, we propose a novel method named IDDLncLoc in this paper, which adopts an ensemble model to solve the problem of the subcellular localization. In the proposal model, dinucleotide-based auto-cross covariance features, k-mer nucleotide composition features, and composition, transition, and distribution features are introduced to encode a raw RNA sequence to vector. To screen out reliable features, feature selection through binomial distribution, and recursive feature elimination is employed. Furthermore, strategies of oversampling in mini-batch, random sampling, and stacking ensemble strategies are customized to overcome the problem of data imbalance on the benchmark dataset. Finally, compared with the latest methods, IDDLncLoc achieves an accuracy of 94.96% on the benchmark dataset, which is 2.59% higher than the best method, and the results further demonstrate IDDLncLoc is excellent on the subcellular localization of lncRNA. Besides, a user-friendly web server is available at http://lncloc.club .
Collapse
Affiliation(s)
- Yan Wang
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, China
- School of Artificial Intelligence, Jilin University, Changchun, China
| | - Xiaopeng Zhu
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, China
| | - Lili Yang
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, China
- Department of Obstetrics, The First Hospital of Jilin University, Changchun, China
| | - Xuemei Hu
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, China
| | - Kai He
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, China
| | - Cuinan Yu
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, China
| | - Shaoqing Jiao
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, China
| | - Jiali Chen
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, China
| | - Rui Guo
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, China
| | - Sen Yang
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, China.
| |
Collapse
|
4
|
Abstract
Ribosome profiling shows potential for studying the function of long noncoding RNAs (lncRNAs). We introduce a bioinformatics pipeline for detecting ribosome-associated lncRNAs (ribo-lncRNAs) from ribosome profiling data. Further, we describe a machine-learning approach for the characterization of ribo-lncRNAs based on their sequence features. Scripts for ribo-lncRNA analysis can be accessed at ( https://ribolnc.hamadalab.com/ ).
Collapse
Affiliation(s)
- Chao Zeng
- AIST-Waseda University Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), Tokyo, Japan.,Faculty of Science and Engineering, Waseda University, Tokyo, Japan
| | - Michiaki Hamada
- AIST-Waseda University Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), Tokyo, Japan. .,Faculty of Science and Engineering, Waseda University, Tokyo, Japan. .,Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan. .,Institute for Medical-oriented Structural Biology, Waseda University, Tokyo, Japan. .,Graduate School of Medicine, Nippon Medical School, Tokyo, Japan.
| |
Collapse
|
5
|
Zhao Z, Zhang X, Chen F, Fang L, Li J. Accurate prediction of DNA N 4-methylcytosine sites via boost-learning various types of sequence features. BMC Genomics 2020; 21:627. [PMID: 32917152 PMCID: PMC7488740 DOI: 10.1186/s12864-020-07033-8] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2020] [Accepted: 08/27/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND DNA N4-methylcytosine (4mC) is a critical epigenetic modification and has various roles in the restriction-modification system. Due to the high cost of experimental laboratory detection, computational methods using sequence characteristics and machine learning algorithms have been explored to identify 4mC sites from DNA sequences. However, state-of-the-art methods have limited performance because of the lack of effective sequence features and the ad hoc choice of learning algorithms to cope with this problem. This paper is aimed to propose new sequence feature space and a machine learning algorithm with feature selection scheme to address the problem. RESULTS The feature importance score distributions in datasets of six species are firstly reported and analyzed. Then the impact of the feature selection on model performance is evaluated by independent testing on benchmark datasets, where ACC and MCC measurements on the performance after feature selection increase by 2.3% to 9.7% and 0.05 to 0.19, respectively. The proposed method is compared with three state-of-the-art predictors using independent test and 10-fold cross-validations, and our method outperforms in all datasets, especially improving the ACC by 3.02% to 7.89% and MCC by 0.06 to 0.15 in the independent test. Two detailed case studies by the proposed method have confirmed the excellent overall performance and correctly identified 24 of 26 4mC sites from the C.elegans gene, and 126 out of 137 4mC sites from the D.melanogaster gene. CONCLUSIONS The results show that the proposed feature space and learning algorithm with feature selection can improve the performance of DNA 4mC prediction on the benchmark datasets. The two case studies prove the effectiveness of our method in practical situations.
Collapse
Affiliation(s)
- Zhixun Zhao
- Advanced Analytics Institute, Faculty of Engineering and Information Technology, University of Technology Sydney, PO Box 123, Broadway, Sydney, NSW 2007, Australia
| | - Xiaocai Zhang
- Advanced Analytics Institute, Faculty of Engineering and Information Technology, University of Technology Sydney, PO Box 123, Broadway, Sydney, NSW 2007, Australia
| | - Fang Chen
- Data Science Institute, University of Technology Sydney, PO Box 123, Broadway, Sydney, NSW 2007, Australia
| | - Liang Fang
- School of Computer, National University of Defense Technology, Changsha, 410073, China
| | - Jinyan Li
- Advanced Analytics Institute, Faculty of Engineering and Information Technology, University of Technology Sydney, PO Box 123, Broadway, Sydney, NSW 2007, Australia.
| |
Collapse
|
6
|
Abstract
BACKGROUND With the increasing number of annotated long noncoding RNAs (lncRNAs) from the genome, researchers are continually updating their understanding of lncRNAs. Recently, thousands of lncRNAs have been reported to be associated with ribosomes in mammals. However, their biological functions or mechanisms are still unclear. RESULTS In this study, we tried to investigate the sequence features involved in the ribosomal association of lncRNA. We have extracted ninety-nine sequence features corresponding to different biological mechanisms (i.e., RNA splicing, putative ORF, k-mer frequency, RNA modification, RNA secondary structure, and repeat element). An [Formula: see text]-regularized logistic regression model was applied to screen these features. Finally, we obtained fifteen and nine important features for the ribosomal association of human and mouse lncRNAs, respectively. CONCLUSION To our knowledge, this is the first study to characterize ribosome-associated lncRNAs and ribosome-free lncRNAs from the perspective of sequence features. These sequence features that were identified in this study may shed light on the biological mechanism of the ribosomal association and provide important clues for functional analysis of lncRNAs.
Collapse
Affiliation(s)
- Chao Zeng
- Faculty of Science and Engineering, Waseda University, 55N-06-10, 3-4-1 Okubo Shinjuku-ku, Tokyo, 169-8555, Japan.
- AIST-Waseda University Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), 3-4-1, Okubo Shinjuku-ku, Tokyo, 169-8555, Japan.
| | - Michiaki Hamada
- Faculty of Science and Engineering, Waseda University, 55N-06-10, 3-4-1 Okubo Shinjuku-ku, Tokyo, 169-8555, Japan.
- AIST-Waseda University Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), 3-4-1, Okubo Shinjuku-ku, Tokyo, 169-8555, Japan.
- Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), 2-41-6 Aomi, Koto-ku, Tokyo, 135-0064, Japan.
- Institute for Medical-oriented Structural Biology, Waseda University, 2-2, Wakamatsu-cho Shinjuku-ku, Tokyo, 162-8480, Japan.
- Graduate School of Medicine, Nippon Medical School, 1-1-5, Sendagi, Bunkyo-ku, Tokyo, 113-8602, Japan.
| |
Collapse
|
7
|
Shi JY, Huang H, Zhang YN, Long YX, Yiu SM. Predicting binary, discrete and continued lncRNA-disease associations via a unified framework based on graph regression. BMC Med Genomics 2017; 10:65. [PMID: 29322937 PMCID: PMC5763297 DOI: 10.1186/s12920-017-0305-y] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
Background In human genomes, long non-coding RNAs (lncRNAs) have attracted more and more attention because their dysfunctions are involved in many diseases. However, the associations between lncRNAs and diseases (LDA) still remain unknown in most cases. While identifying disease-related lncRNAs in vivo is costly, computational approaches are promising to not only accelerate the possible identification of associations but also provide clues on the underlying mechanism of various lncRNA-caused diseases. Former computational approaches usually only focus on predicting new associations between lncRNAs having known associations with diseases and other lncRNA-associated diseases. They also only work on binary lncRNA-disease associations (whether the pair has an association or not), which cannot reflect and reveal other biological facts, such as the number of proteins involved in LDA or how strong the association is (i.e., the intensity of LDA). Results To address abovementioned issues, we propose a graph regression-based unified framework (GRUF). In particular, our method can work on lncRNAs, which have no previously known disease association and diseases that have no known association with any lncRNAs. Also, instead of only a binary answer for the association, our method tries to uncover more biological relationship between a pair of lncRNA and disease, which may provide better clues for researchers. We compared GRUF with three state-of-the-art approaches and demonstrated the superiority of GRUF, which achieves 5%~16% improvement in terms of the area under the receiver operating characteristic curve (AUC). GRUF also provides a predicted confidence score for the predicted LDA, which reveals the significant correlation between the score and the number of RNA-Binding Proteins involved in LDAs. Lastly, three out of top-5 LDA candidates generated by GRUF in novel prediction are verified indirectly by medical literature and known biological facts. Conclusions The proposed GRUF has two advantages over existing approaches. Firstly, it can be used to work on lncRNAs that have no known disease association and diseases that have no known association with any lncRNAs. Secondly, instead of providing a binary answer (with or without association), GRUF works for both discrete and continued LDA, which help revealing the pathological implications between lncRNAs and diseases.
Collapse
Affiliation(s)
- Jian-Yu Shi
- School of Life Sciences, Northwestern Polytechnical University, Xi'an, 710072, China.
| | - Hua Huang
- School of Software and Microelectronics, Northwestern Polytechnical University, Xi'an, 710072, China
| | - Yan-Ning Zhang
- School of Computer Science, Northwestern Polytechnical University, Xi'an, 710072, China
| | - Yu-Xi Long
- School of Computer Science, Northwestern Polytechnical University, Xi'an, 710072, China
| | - Siu-Ming Yiu
- Department of Computer Science, the University of Hong Kong, Hong Kong, 999077, China
| |
Collapse
|
8
|
Bolleman JT, Mungall CJ, Strozzi F, Baran J, Dumontier M, Bonnal RJP, Buels R, Hoehndorf R, Fujisawa T, Katayama T, Cock PJA. FALDO: a semantic standard for describing the location of nucleotide and protein feature annotation. J Biomed Semantics 2016; 7:39. [PMID: 27296299 PMCID: PMC4907002 DOI: 10.1186/s13326-016-0067-z] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2014] [Accepted: 03/17/2016] [Indexed: 11/18/2022] Open
Abstract
Background Nucleotide and protein sequence feature annotations are essential to understand biology on the genomic, transcriptomic, and proteomic level. Using Semantic Web technologies to query biological annotations, there was no standard that described this potentially complex location information as subject-predicate-object triples. Description We have developed an ontology, the Feature Annotation Location Description Ontology (FALDO), to describe the positions of annotated features on linear and circular sequences. FALDO can be used to describe nucleotide features in sequence records, protein annotations, and glycan binding sites, among other features in coordinate systems of the aforementioned “omics” areas. Using the same data format to represent sequence positions that are independent of file formats allows us to integrate sequence data from multiple sources and data types. The genome browser JBrowse is used to demonstrate accessing multiple SPARQL endpoints to display genomic feature annotations, as well as protein annotations from UniProt mapped to genomic locations. Conclusions Our ontology allows users to uniformly describe – and potentially merge – sequence annotations from multiple sources. Data sources using FALDO can prospectively be retrieved using federalised SPARQL queries against public SPARQL endpoints and/or local private triple stores.
Collapse
Affiliation(s)
- Jerven T Bolleman
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1 rue Michel, Servet, Geneva 4, 1211, Switzerland.
| | | | | | - Joachim Baran
- CODAMONO, 5-121 Marion Street, Toronto, M6R 1E6, Ontario, Canada
| | - Michel Dumontier
- Stanford Center for Biomedical Informatics Research, 1265 Welch Road, Room X223, Stanford, 94305-5479, CA, US
| | - Raoul J P Bonnal
- Integrative Biology Program, Istituto Nazionale Genetica Molecolare, Milan, Italy
| | - Robert Buels
- University of California, Berkeley, Berkeley, CA, USA
| | | | - Takatomo Fujisawa
- Center for Information Biology, National Institute of Genetics, Research Organization of Information and Systems, 1111 Yata, Mishima, Shizuoka, 411-08540, Japan
| | - Toshiaki Katayama
- Database Center for Life Science, Research Organization of Information and Systems, 2-11-16, Yayoi, Bunkyo-ku, Tokyo, 113-0032, Japan
| | | |
Collapse
|