1
|
Xu R, Li D, Yang W, Wang G, Li Y. Improving ncRNA family prediction using multi-modal contrastive learning of sequence and structure. Bioinformatics 2024; 40:btae640. [PMID: 39460948 PMCID: PMC11639665 DOI: 10.1093/bioinformatics/btae640] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2024] [Revised: 10/15/2024] [Accepted: 10/22/2024] [Indexed: 10/28/2024] Open
Abstract
MOTIVATION Recent advancements in high-throughput sequencing technology have significantly increased the focus on non-coding RNA (ncRNA) research within the life sciences. Despite this, the functions of many ncRNAs remain poorly understood. Research suggests that ncRNAs within the same family typically share similar functions, underlining the importance of understanding their roles. There are two primary methods for predicting ncRNA families: biological and computational. Traditional biological methods are not suitable for large-scale data prediction due to the significant human and resource requirements. Concurrently, most existing computational methods either rely solely on ncRNA sequence data or are exclusively based on the secondary structure of ncRNA molecules. These methods fail to fully utilize the rich multimodal information available from ncRNAs, thereby preventing them from learning more comprehensive and in-depth feature representations. RESULTS To tackle these problems, we proposed MM-ncRNAFP, a multi-modal contrastive learning framework for ncRNA family prediction. We first used a pre-trained language model to encode the primary sequences of a large mammalian ncRNA dataset. Then, we adopted a contrastive learning framework with an attention mechanism to fuse the secondary structure information obtained by graph neural networks. The MM-ncRNAFP method can effectively fuse multi-modal information. Experimental comparisons with several competitive baselines demonstrated that MM-ncRNAFP can achieve more comprehensive representations of ncRNA features by integrating both sequence and structural information. This integration significantly enhances the performance of ncRNA family prediction. Ablation experiments and qualitative analyses were performed to verify the effectiveness of each component in our model. Moreover, since our model is pre-trained on a large amount of ncRNA data, it has the potential to bring significant improvements to other ncRNA-related tasks. AVAILABILITY AND IMPLEMENTATION MM-ncRNAFP and the datasets are available at https://github.com/xuruiting2/MM-ncRNAFP.
Collapse
Affiliation(s)
- Ruiting Xu
- College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China
| | - Dan Li
- College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China
| | - Wen Yang
- International Medical Center, Shenzhen University General Hospital, SZU 518055, China
| | - Guohua Wang
- College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China
| | - Yang Li
- College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China
| |
Collapse
|
2
|
Zhang Y, Wang J, Yu J. PSA: an effective method for predicting horizontal gene transfers through parsimonious phylogenetic networks. Cladistics 2024; 40:443-455. [PMID: 38717786 DOI: 10.1111/cla.12578] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2023] [Revised: 03/08/2024] [Accepted: 03/20/2024] [Indexed: 07/15/2024] Open
Abstract
Horizontal gene transfer (HGT) from one organism to another, according to some researchers, can be abundant in the evolution of species. A phylogenetic network is a network structure that describes the HGTs among species. Several studies have proposed methods to construct phylogenetic networks to predict HGTs based on parsimony values. Existing definitions of parsimony values for a phylogenetic network are based on the assumption that each gene site or segment evolves independently along different trees in the network. However, in the current study, we define a novel parsimony value, denoted the p definition, for phylogenetic networks, considering that a gene as a whole typically evolves along a tree. Using Simulated Annealing, a new method called the Phylogeny with Simulated Annealing (PSA) algorithm is proposed to search for an optimal network based on the p definition. The PSA method is tested on the simulated data. The results reveal that the parsimonious networks constructed using PSA can better represent the evolutionary relationships of species involving HGTs. Additionally, the HGTs predicted using PSA are more accurate than those predicted using other methods. The PSA algorithm is publicly accessible at http://github.com/imustu/sap.
Collapse
Affiliation(s)
- Yuan Zhang
- School of Computer Science, Inner Mongolia University, Hohhot, 010021, China
| | - Juan Wang
- School of Computer Science, Inner Mongolia University, Hohhot, 010021, China
| | - Jing Yu
- College of Education, Inner Mongolia Normal University, Hohhot, 010022, China
| |
Collapse
|
3
|
Hong Y, Wang J. Frin: An Efficient Method for Representing Genome Evolutionary History. Front Genet 2019; 10:1261. [PMID: 31867045 PMCID: PMC6909884 DOI: 10.3389/fgene.2019.01261] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2019] [Accepted: 11/14/2019] [Indexed: 11/13/2022] Open
Abstract
Phylogenetic analysis is important in understanding the process of biological evolution, and phylogenetic trees are used to represent the evolutionary history. Each taxon in a phylogenetic tree has not more than one parent, so phylogenetic trees cannot express the complex evolutionary information implicit in phylogeny. Phylogenetic networks can be used to express genome evolutionary histories. Therefore, it is great significance to research the construction of phylogenetic networks. Cass algorithm is an efficient method for constructing phylogenetic networks because it can construct a much simpler network. However, Cass relies heavily on the order of input data, i.e. different networks can be constructed for the same dataset with different input orders. Based on the frequency and incompatibility degree of taxa, we propose an efficiently improved algorithm of Cass, called as Frin. The experimental results show that the networks constructed by Frin are not only simpler than those constructed by other methods, but Frin can also construct more consistent phylogenetic networks when the treated data have different input orders. Furthermore, the phylogenetic network constructed by Frin is closer to the original information described by phylogenetic trees. Frin has been built as a Java software package and is freely available at https://github.com/wangjuanimu/Frin.
Collapse
Affiliation(s)
- Yan Hong
- School of Computer Science, Inner Mongolia University, Hohhot, China
| | - Juan Wang
- School of Computer Science, Inner Mongolia University, Hohhot, China
| |
Collapse
|
4
|
Jamil HM. Optimizing Phylogenetic Queries for Performance. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:1692-1705. [PMID: 28858810 DOI: 10.1109/tcbb.2017.2743706] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
The vast majority of phylogenetic databases do not support declarative querying using which their contents can be flexibly and conveniently accessed and the template based query interfaces they support do not allow arbitrary speculative queries. They therefore also do not support query optimization leveraging unique phylogeny properties. While a small number of graph query languages such as XQuery, Cypher, and GraphQL exist for computer savvy users, most are too general and complex to be useful for biologists, and too inefficient for large phylogeny querying. In this paper, we discuss a recently introduced visual query language, called PhyQL, that leverages phylogeny specific properties to support essential and powerful constructs for a large class of phylogentic queries. We develop a range of pruning aids, and propose a substantial set of query optimization strategies using these aids suitable for large phylogeny querying. A hybrid optimization technique that exploits a set of indices and "graphlet" partitioning is discussed. A "fail soonest" strategy is used to avoid hopeless processing and is shown to produce dividends. Possible novel optimization techniques yet to be explored are also discussed.
Collapse
|
5
|
Wang J, Guo M. A review of metrics measuring dissimilarity for rooted phylogenetic networks. Brief Bioinform 2018; 20:1972-1980. [DOI: 10.1093/bib/bby062] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2018] [Revised: 06/20/2018] [Indexed: 11/14/2022] Open
Abstract
Abstract
A rooted phylogenetic network is an important structure in the description of evolutionary relationships. Computing the distance (topological dissimilarity) between two rooted phylogenetic networks is a fundamental in phylogenic analysis. During the past few decades, several polynomial-time computable metrics have been described. Here, we give a comprehensive review and analysis on those metrics, including the correlation among metrics and the distribution of distance values computed by each metric. Moreover, we describe the software and website, CDRPN (Computing Distance for Rooted Phylogenetic Networks), for measuring the topological dissimilarity between rooted phylogenetic networks.
Availability
http://bioinformatics.imu.edu.cn/distance/
Contact
guomaozu@bucea.edu.cn
Collapse
Affiliation(s)
- Juan Wang
- School of Computer Science, Inner Mongolia University, Hohhot, Inner Mongolia 010021, P.R. China
| | - Maozu Guo
- School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing 100044, P.R. ChinaBeijing Key Laboratory of Intelligent Processing for Building Big Data, Beijing 100044, P.R. China
| |
Collapse
|
6
|
Abstract
BACKGROUND Building the evolutionary trees for massive unaligned DNA sequences is challenging and crucial. However, reconstructing evolutionary tree for ultra-large sequences is hard. Massive multiple sequence alignment is also challenging and time/space consuming. Hadoop and Spark are developed recently, which bring spring light for the classical computational biology problems. In this paper, we tried to solve the multiple sequence alignment and evolutionary reconstruction in parallel. RESULTS HPTree, which is developed in this paper, can deal with big DNA sequence files quickly. It works well on the >1GB files, and gets better performance than other evolutionary reconstruction tools. Users could use HPTree for reonstructing evolutioanry trees on the computer clusters or cloud platform (eg. Amazon Cloud). HPTree could help on population evolution research and metagenomics analysis. CONCLUSIONS In this paper, we employ the Hadoop and Spark platform and design an evolutionary tree reconstruction software tool for unaligned massive DNA sequences. Clustering and multiple sequence alignment are done in parallel. Neighbour-joining model was employed for the evolutionary tree building. We opened our software together with source codes via http://lab.malab.cn/soft/HPtree/ .
Collapse
Affiliation(s)
- Quan Zou
- School of Computer Science and Technology, Tianjin University, Tianjin, People's Republic of China
- Guangdong Province Key Laboratory of Popular High Performance Computers, Shenzhen University, Shenzhen, China
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, China
| | - Shixiang Wan
- School of Computer Science and Technology, Tianjin University, Tianjin, People's Republic of China
| | - Xiangxiang Zeng
- Department of Computer Science, Xiamen University, Xiamen, China.
| | - Zhanshan Sam Ma
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, China.
| |
Collapse
|
7
|
Chen X, Wang C, Tang S, Yu C, Zou Q. CMSA: a heterogeneous CPU/GPU computing system for multiple similar RNA/DNA sequence alignment. BMC Bioinformatics 2017. [PMID: 28646874 PMCID: PMC5483318 DOI: 10.1186/s12859-017-1725-6] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
Background The multiple sequence alignment (MSA) is a classic and powerful technique for sequence analysis in bioinformatics. With the rapid growth of biological datasets, MSA parallelization becomes necessary to keep its running time in an acceptable level. Although there are a lot of work on MSA problems, their approaches are either insufficient or contain some implicit assumptions that limit the generality of usage. First, the information of usersโ sequences, including the sizes of datasets and the lengths of sequences, can be of arbitrary values and are generally unknown before submitted, which are unfortunately ignored by previous work. Second, the center star strategy is suited for aligning similar sequences. But its first stage, center sequence selection, is highly time-consuming and requires further optimization. Moreover, given the heterogeneous CPU/GPU platform, prior studies consider the MSA parallelization on GPU devices only, making the CPUs idle during the computation. Co-run computation, however, can maximize the utilization of the computing resources by enabling the workload computation on both CPU and GPU simultaneously. Results This paper presents CMSA, a robust and efficient MSA system for large-scale datasets on the heterogeneous CPU/GPU platform. It performs and optimizes multiple sequence alignment automatically for usersโ submitted sequences without any assumptions. CMSA adopts the co-run computation model so that both CPU and GPU devices are fully utilized. Moreover, CMSA proposes an improved center star strategy that reduces the time complexity of its center sequence selection process from O(mn2) to O(mn). The experimental results show that CMSA achieves an up to 11ร speedup and outperforms the state-of-the-art software. Conclusion CMSA focuses on the multiple similar RNA/DNA sequence alignment and proposes a novel bitmap based algorithm to improve the center star strategy. We can conclude that harvesting the high performance of modern GPU is a promising approach to accelerate multiple sequence alignment. Besides, adopting the co-run computation model can maximize the entire system utilization significantly. The source code is available at https://github.com/wangvsa/CMSA.
Collapse
Affiliation(s)
- Xi Chen
- School of Computer Science and Technology, Tianjin University, Yaguan Road, Tianjin, China
| | - Chen Wang
- School of Computer Science and Technology, Tianjin University, Yaguan Road, Tianjin, China
| | - Shanjiang Tang
- School of Computer Science and Technology, Tianjin University, Yaguan Road, Tianjin, China
| | - Ce Yu
- School of Computer Science and Technology, Tianjin University, Yaguan Road, Tianjin, China.
| | - Quan Zou
- School of Computer Science and Technology, Tianjin University, Yaguan Road, Tianjin, China
| |
Collapse
|
8
|
Wang J, Guo M. A Metric on the Space of kth-order reduced Phylogenetic Networks. Sci Rep 2017; 7:3189. [PMID: 28600511 DOI: 10.1038/s41598-017-03363-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2016] [Accepted: 04/27/2017] [Indexed: 11/09/2022] Open
Abstract
Phylogenetic networks can be used to describe the evolutionary history of species which experience a certain number of reticulate events, and represent conflicts in phylogenetic trees that may be due to inadequacies of the evolutionary model used in the construction of the trees. Measuring the dissimilarity between two phylogenetic networks is at the heart of our understanding of the evolutionary history of species. This paper proposes a new metric, i.e. kth-distance, for the space of kth-order reduced phylogenetic networks that can be calculated in polynomial time in the size of the compared networks.
Collapse
Affiliation(s)
- Juan Wang
- School of Computer Science, Inner Mongolia University, Hohhot, 010021, P.R. China
| | - Maozu Guo
- School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing, 100044, P.R. China.
| |
Collapse
|
9
|
Wang J. A Survey of Methods for Constructing Rooted Phylogenetic Networks. PLoS One 2016; 11:e0165834. [PMID: 27806124 PMCID: PMC5091748 DOI: 10.1371/journal.pone.0165834] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2016] [Accepted: 10/18/2016] [Indexed: 11/18/2022] Open
Abstract
Rooted phylogenetic networks are primarily used to represent conflicting evolutionary information and describe the reticulate evolutionary events in phylogeny. So far a lot of methods have been presented for constructing rooted phylogenetic networks, of which the methods based on the decomposition property of networks and by means of the incompatible graph (such as the CASS, the LNETWORK and the BIMLR) are more efficient than other available methods. The paper will discuss and compare these methods by both the practical and artificial datasets, in the aspect of the running time of the methods and the effective of constructed phylogenetic networks. The results show that the LNETWORK can construct much simper networks than the others.
Collapse
Affiliation(s)
- Juan Wang
- School of Computer Science, Inner Mongolia University, Hohhot, China
- * E-mail:
| |
Collapse
|
10
|
Constructing Phylogenetic Networks Based on the Isomorphism of Datasets. BIOMED RESEARCH INTERNATIONAL 2016; 2016:4236858. [PMID: 27547759 PMCID: PMC4980496 DOI: 10.1155/2016/4236858] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/31/2016] [Accepted: 06/30/2016] [Indexed: 11/18/2022]
Abstract
Constructing rooted phylogenetic networks from rooted phylogenetic trees has become an important problem in molecular evolution. So far, many methods have been presented in this area, in which most efficient methods are based on the incompatible graph, such as the CASS, the LNETWORK, and the BIMLR. This paper will research the commonness of the methods based on the incompatible graph, the relationship between incompatible graph and the phylogenetic network, and the topologies of incompatible graphs. We can find out all the simplest datasets for a topology G and construct a network for every dataset. For any one dataset ๐, we can compute a network from the network representing the simplest dataset which is isomorphic to ๐. This process will save more time for the algorithms when constructing networks.
Collapse
|
11
|
Wang J. A Metric on the Space of Partly Reduced Phylogenetic Networks. BIOMED RESEARCH INTERNATIONAL 2016; 2016:7534258. [PMID: 27419137 PMCID: PMC4935902 DOI: 10.1155/2016/7534258] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/30/2016] [Accepted: 05/23/2016] [Indexed: 11/17/2022]
Abstract
Phylogenetic networks are a generalization of phylogenetic trees that allow for the representation of evolutionary events acting at the population level, such as recombination between genes, hybridization between lineages, and horizontal gene transfer. The researchers have designed several measures for computing the dissimilarity between two phylogenetic networks, and each measure has been proven to be a metric on a special kind of phylogenetic networks. However, none of the existing measures is a metric on the space of partly reduced phylogenetic networks. In this paper, we provide a metric, d e -distance, on the space of partly reduced phylogenetic networks, which is polynomial-time computable.
Collapse
Affiliation(s)
- Juan Wang
- School of Computer Science, Inner Mongolia University, Hohhot 010021, China
| |
Collapse
|
12
|
Zou Q, Hu Q, Guo M, Wang G. HAlign: Fast multiple similar DNA/RNA sequence alignment based on the centre star strategy. Bioinformatics 2015; 31:2475-81. [DOI: 10.1093/bioinformatics/btv177] [Citation(s) in RCA: 121] [Impact Index Per Article: 12.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2015] [Accepted: 03/23/2015] [Indexed: 12/26/2022] Open
|