1
|
Deep Canonical Correlation Fusion Algorithm Based on Denoising Autoencoder for ASD Diagnosis and Pathogenic Brain Region Identification. Interdiscip Sci 2024:10.1007/s12539-024-00625-y. [PMID: 38573456 DOI: 10.1007/s12539-024-00625-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2023] [Revised: 02/22/2024] [Accepted: 02/25/2024] [Indexed: 04/05/2024]
Abstract
Autism Spectrum Disorder (ASD) is defined as a neurodevelopmental condition distinguished by unconventional neural activities. Early intervention is key to managing the progress of ASD, and current research primarily focuses on the use of structural magnetic resonance imaging (sMRI) or resting-state functional magnetic resonance imaging (rs-fMRI) for diagnosis. Moreover, the use of autoencoders for disease classification has not been sufficiently explored. In this study, we introduce a new framework based on autoencoder, the Deep Canonical Correlation Fusion algorithm based on Denoising Autoencoder (DCCF-DAE), which proves to be effective in handling high-dimensional data. This framework involves efficient feature extraction from different types of data with an advanced autoencoder, followed by the fusion of these features through the DCCF model. Then we utilize the fused features for disease classification. DCCF integrates functional and structural data to help accurately diagnose ASD and identify critical Regions of Interest (ROIs) in disease mechanisms. We compare the proposed framework with other methods by the Autism Brain Imaging Data Exchange (ABIDE) database and the results demonstrate its outstanding performance in ASD diagnosis. The superiority of DCCF-DAE highlights its potential as a crucial tool for early ASD diagnosis and monitoring.
Collapse
|
2
|
OIF-Net: An Optical Flow Registration-Based PET/MR Cross-Modal Interactive Fusion Network for Low-Count Brain PET Image Denoising. IEEE TRANSACTIONS ON MEDICAL IMAGING 2024; 43:1554-1567. [PMID: 38096101 DOI: 10.1109/tmi.2023.3342809] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/04/2024]
Abstract
The short frames of low-count positron emission tomography (PET) images generally cause high levels of statistical noise. Thus, improving the quality of low-count images by using image postprocessing algorithms to achieve better clinical diagnoses has attracted widespread attention in the medical imaging community. Most existing deep learning-based low-count PET image enhancement methods have achieved satisfying results, however, few of them focus on denoising low-count PET images with the magnetic resonance (MR) image modality as guidance. The prior context features contained in MR images can provide abundant and complementary information for single low-count PET image denoising, especially in ultralow-count (2.5%) cases. To this end, we propose a novel two-stream dual PET/MR cross-modal interactive fusion network with an optical flow pre-alignment module, namely, OIF-Net. Specifically, the learnable optical flow registration module enables the spatial manipulation of MR imaging inputs within the network without any extra training supervision. Registered MR images fundamentally solve the problem of feature misalignment in the multimodal fusion stage, which greatly benefits the subsequent denoising process. In addition, we design a spatial-channel feature enhancement module (SC-FEM) that considers the interactive impacts of multiple modalities and provides additional information flexibility in both the spatial and channel dimensions. Furthermore, instead of simply concatenating two extracted features from these two modalities as an intermediate fusion method, the proposed cross-modal feature fusion module (CM-FFM) adopts cross-attention at multiple feature levels and greatly improves the two modalities' feature fusion procedure. Extensive experimental assessments conducted on real clinical datasets, as well as an independent clinical testing dataset, demonstrate that the proposed OIF-Net outperforms the state-of-the-art methods.
Collapse
|
3
|
Deep integrated fusion of local and global features for cervical cell classification. Comput Biol Med 2024; 171:108153. [PMID: 38364660 DOI: 10.1016/j.compbiomed.2024.108153] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2023] [Revised: 02/08/2024] [Accepted: 02/12/2024] [Indexed: 02/18/2024]
Abstract
Cervical cytology image classification is of great significance to the cervical cancer diagnosis and prognosis. Recently, convolutional neural network (CNN) and visual transformer have been adopted as two branches to learn the features for image classification by simply adding local and global features. However, such the simple addition may not be effective to integrate these features. In this study, we explore the synergy of local and global features for cytology images for classification tasks. Specifically, we design a Deep Integrated Feature Fusion (DIFF) block to synergize local and global features of cytology images from a CNN branch and a transformer branch. Our proposed method is evaluated on three cervical cell image datasets (SIPaKMeD, CRIC, Herlev) and another large blood cell dataset BCCD for several multi-class and binary classification tasks. Experimental results demonstrate the effectiveness of the proposed method in cervical cell classification, which could assist medical specialists to better diagnose cervical cancer.
Collapse
|
4
|
pathMap: a path-based mapping tool for long noisy reads with high sensitivity. Brief Bioinform 2024; 25:bbae107. [PMID: 38517696 PMCID: PMC10959152 DOI: 10.1093/bib/bbae107] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2023] [Revised: 12/25/2023] [Accepted: 02/28/2024] [Indexed: 03/24/2024] Open
Abstract
With the rapid development of single-molecule sequencing (SMS) technologies, the output read length is continuously increasing. Mapping such reads onto a reference genome is one of the most fundamental tasks in sequence analysis. Mapping sensitivity is becoming a major concern since high sensitivity can detect more aligned regions on the reference and obtain more aligned bases, which are useful for downstream analysis. In this study, we present pathMap, a novel k-mer graph-based mapper that is specifically designed for mapping SMS reads with high sensitivity. By viewing the alignment chain as a path containing as many anchors as possible in the matched k-mer graph, pathMap treats chaining as a path selection problem in the directed graph. pathMap iteratively searches the longest path in the remaining nodes; more candidate chains with high quality can be effectively detected and aligned. Compared to other state-of-the-art mapping methods such as minimap2 and Winnowmap2, experiment results on simulated and real-life datasets demonstrate that pathMap obtains the number of mapped chains at least 11.50% more than its closest competitor and increases the mapping sensitivity by 17.28% and 13.84% of bases over the next-best mapper for Pacific Biosciences and Oxford Nanopore sequencing data, respectively. In addition, pathMap is more robust to sequence errors and more sensitive to species- and strain-specific identification of pathogens using MinION reads.
Collapse
|
5
|
invMap: a sensitive mapping tool for long noisy reads with inversion structural variants. BIOINFORMATICS (OXFORD, ENGLAND) 2023; 39:btad726. [PMID: 38058196 DOI: 10.1093/bioinformatics/btad726] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/06/2023] [Revised: 11/02/2023] [Accepted: 12/05/2023] [Indexed: 12/08/2023]
Abstract
MOTIVATION Longer reads produced by PacBio or Oxford Nanopore sequencers could more frequently span the breakpoints of structural variations (SVs) than shorter reads. Therefore, existing long-read mapping methods often generate wrong alignments and variant calls. Compared to deletions and insertions, inversion events are more difficult to be detected since the anchors in inversion regions are nonlinear to those in SV-free regions. To address this issue, this study presents a novel long-read mapping algorithm (named as invMap). RESULTS For each long noisy read, invMap first locates the aligned region with a specifically designed scoring method for chaining, then checks the remaining anchors in the aligned region to discover potential inversions. We benchmark invMap on simulated datasets across different genomes and sequencing coverages, experimental results demonstrate that invMap is more accurate to locate aligned regions and call SVs for inversions than the competing methods. The real human genome sequencing dataset of NA12878 illustrates that invMap can effectively find more candidate variant calls for inversions than the competing methods. AVAILABILITY AND IMPLEMENTATION The invMap software is available at https://github.com/zhang134/invMap.git.
Collapse
|
6
|
Sparse2Noise: Low-dose synchrotron X-ray tomography without high-quality reference data. Comput Biol Med 2023; 165:107473. [PMID: 37690288 DOI: 10.1016/j.compbiomed.2023.107473] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2023] [Revised: 08/30/2023] [Accepted: 09/04/2023] [Indexed: 09/12/2023]
Abstract
BACKGROUND Synchrotron radiation computed tomography (SR-CT) holds promise for high-resolution in vivo imaging. Notably, the reconstruction of SR-CT images necessitates a large set of data to be captured with sufficient photons from multiple angles, resulting in high radiation dose received by the object. Reducing the number of projections and/or photon flux is a straightforward means to lessen the radiation dose, however, compromises data completeness, thus introducing noises and artifacts. Deep learning (DL)-based supervised methods effectively denoise and remove artifacts, but they heavily depend on high-quality paired data acquired at high doses. Although algorithms exist for training without high-quality references, they struggle to effectively eliminate persistent artifacts present in real-world data. METHODS This work presents a novel low-dose imaging strategy namely Sparse2Noise, which combines the reconstruction data from paired sparse-view CT scan (normal-flux) and full-view CT scan (low-flux) using a convolutional neural network (CNN). Sparse2Noise does not require high-quality reconstructed data as references and allows for fresh training on data with very small size. Sparse2Noise was evaluated by both simulated and experimental data. RESULTS Sparse2Noise effectively reduces noise and ring artifacts while maintaining high image quality, outperforming state-of-the-art image denoising methods at same dose levels. Furthermore, Sparse2Noise produces impressive high image quality for ex vivo rat hindlimb imaging with the acceptable low radiation dose (i.e., 0.5 Gy with the isotropic voxel size of 26 μm). CONCLUSIONS This work represents a significant advance towards in vivo SR-CT imaging. It is noteworthy that Sparse2Noise can also be used for denoising in conventional CT and/or phase-contrast CT.
Collapse
|
7
|
A multi-modal deep neural network for multi-class liver cancer diagnosis. Neural Netw 2023; 165:553-561. [PMID: 37354807 DOI: 10.1016/j.neunet.2023.06.013] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2022] [Revised: 01/21/2023] [Accepted: 06/07/2023] [Indexed: 06/26/2023]
Abstract
Liver disease is a potentially asymptomatic clinical entity that may progress to patient death. This study proposes a multi-modal deep neural network for multi-class malignant liver diagnosis. In parallel with the portal venous computed tomography (CT) scans, pathology data is utilized to prognosticate primary liver cancer variants and metastasis. The processed CT scans are fed to the deep dilated convolution neural network to explore salient features. The residual connections are further added to address vanishing gradient problems. Correspondingly, five pathological features are learned using a wide and deep network that gives a benefit of memorization with generalization. The down-scaled hierarchical features from CT scan and pathology data are concatenated to pass through fully connected layers for classification between liver cancer variants. In addition, the transfer learning of pre-trained deep dilated convolution layers assists in handling insufficient and imbalanced dataset issues. The fine-tuned network can predict three-class liver cancer variants with an average accuracy of 96.06% and an Area Under Curve (AUC) of 0.832. To the best of our knowledge, this is the first study to classify liver cancer variants by integrating pathology and image data, hence following the medical perspective of malignant liver diagnosis. The comparative analysis on the benchmark dataset shows that the proposed multi-modal neural network outperformed most of the liver diagnostic studies and is comparable to others.
Collapse
|
8
|
A posterior probability based Bayesian method for single-cell RNA-seq data imputation. Methods 2023; 216:21-38. [PMID: 37315825 DOI: 10.1016/j.ymeth.2023.06.004] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2023] [Revised: 05/19/2023] [Accepted: 06/07/2023] [Indexed: 06/16/2023] Open
Abstract
Single-cell RNA-sequencing (scRNA-seq) data suffer from a lot of zeros. Such dropout events impede the downstream data analyses. We propose BayesImpute to infer and impute dropouts from the scRNA-seq data. Using the expression rate and coefficient of variation of the genes within the cell subpopulation, BayesImpute first determines likely dropouts, and then constructs the posterior distribution for each gene and uses the posterior mean to impute dropout values. Some simulated and real experiments show that BayesImpute can effectively identify dropout events and reduce the introduction of false positive signals. Additionally, BayesImpute successfully recovers the true expression levels of missing values, restores the gene-to-gene and cell-to-cell correlation coefficient, and maintains the biological information in bulk RNA-seq data. Furthermore, BayesImpute boosts the clustering and visualization of cell subpopulations and improves the identification of differentially expressed genes. We further demonstrate that, in comparison to other statistical-based imputation methods, BayesImpute is scalable and fast with minimal memory usage.
Collapse
|
9
|
PreOBP_ML: Machine Learning Algorithms for Prediction of Optical Biosensor Parameters. MICROMACHINES 2023; 14:1174. [PMID: 37374757 DOI: 10.3390/mi14061174] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/15/2023] [Revised: 05/28/2023] [Accepted: 05/29/2023] [Indexed: 06/29/2023]
Abstract
To develop standard optical biosensors, the simulation procedure takes a lot of time. For reducing that enormous amount of time and effort, machine learning might be a better solution. Effective indices, core power, total power, and effective area are the most crucial parameters for evaluating optical sensors. In this study, several machine learning (ML) approaches have been applied to predict those parameters while considering the core radius, cladding radius, pitch, analyte, and wavelength as the input vectors. We have utilized least squares (LS), LASSO, Elastic-Net (ENet), and Bayesian ridge regression (BRR) to make a comparative discussion using a balanced dataset obtained with the COMSOL Multiphysics simulation tool. Furthermore, a more extensive analysis of sensitivity, power fraction, and confinement loss is also demonstrated using the predicted and simulated data. The suggested models were also examined in terms of R2-score, mean average error (MAE), and mean squared error (MSE), with all of the models having an R2-score of more than 0.99, and it was also shown that optical biosensors had a design error rate of less than 3%. This research might pave the way for machine learning-based optimization approaches to be used to improve optical biosensors.
Collapse
|
10
|
Biomarker Identification via a Factorization Machine-Based Neural Network With Binary Pairwise Encoding. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:2136-2146. [PMID: 37018561 DOI: 10.1109/tcbb.2023.3235299] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Biomolecules, microRNAs (miRNAs) and long non-coding RNAs (lncRNAs), play critical roles in diverse fundamental and vital biological processes. They can serve as disease biomarkers as their dysregulations could cause complex human diseases. Identifying those biomarkers is helpful with the diagnosis, treatment, prognosis, and prevention of diseases. In this study, we propose a factorization machine-based deep neural network with binary pairwise encoding, DFMbpe, to identify the disease-related biomarkers. First, to comprehensively consider the interdependence of features, a binary pairwise encoding method is designed to obtain the raw feature representations for each biomarker-disease pair. Second, the raw features are mapped into their corresponding embedding vectors. Then, the factorization machine is conducted to get the wide low-order feature interdependence, while the deep neural network is applied to obtain the deep high-order feature interdependence. Finally, two kinds of features are combined to get the final prediction results. Unlike other biomarker identification models, the binary pairwise encoding considers the interdependence of features even though they never appear in the same sample, and the DFMbpe architecture emphasizes both low-order and high-order feature interactions simultaneously. The experimental results show that DFMbpe greatly outperforms the state-of-the-art identification models on both cross-validation and independent dataset evaluation. Besides, three types of case studies further demonstrate the effectiveness of this model.
Collapse
|
11
|
A Two-Branch Neural Network for Short-Axis PET Image Quality Enhancement. IEEE J Biomed Health Inform 2023; PP. [PMID: 37030746 DOI: 10.1109/jbhi.2023.3260180] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/10/2023]
Abstract
The axial field of view (FOV) is a key factor that affects the quality of PET images. Due to hardware FOV restrictions, conventional short-axis PET scanners with FOVs of 20 to 35 cm can acquire only low-quality PET (LQ-PET) images in fast scanning times (2-3 minutes). To overcome hardware restrictions and improve PET image quality for better clinical diagnoses, several deep learning-based algorithms have been proposed. However, these approaches use simple convolution layers with residual learning and local attention, which insufficiently extract and fuse long-range contextual information. To this end, we propose a novel two-branch network architecture with swin transformer units and graph convolution operation, namely SW-GCN. The proposed SW-GCN provides additional spatial- and channel-wise flexibility to handle different types of input information flow. Specifically, considering the high computational cost of calculating self-attention weights in full-size PET images, in our designed spatial adaptive branch, we take the self-attention mechanism within each local partition window and introduce global information interactions between nonoverlapping windows by shifting operations to prevent the aforementioned problem. In addition, the convolutional network structure tends to consider the information in each channel equally during the feature extraction process. In our designed channel adaptive branch, we use a Watts Strogatz topology structure to connect each feature map to its most relevant features in each graph convolutional layer, substantially reducing information redundancy. Moreover, ensemble learning is adopted in our SW-GCN for mapping distinct features from the two well-designed branches to the enhanced PET images. We carried out extensive experiments on three single-bed position scans for 386 patients. The test results demonstrate that our proposed SW-GCN approach outperforms state-of-the-art methods in both quantitative and qualitative evaluations.
Collapse
|
12
|
Multi-level GAN based enhanced CT scans for liver cancer diagnosis. Biomed Signal Process Control 2023. [DOI: 10.1016/j.bspc.2022.104450] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022]
|
13
|
NMTF-DTI: A Nonnegative Matrix Tri-factorization Approach With Multiple Kernel Fusion for Drug-Target Interaction Prediction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:586-594. [PMID: 34914594 DOI: 10.1109/tcbb.2021.3135978] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Prediction of drug-target interactions (DTIs) plays a significant role in drug development and drug discovery. Although this task requires a large investment in terms of time and cost, especially when it is performed experimentally, the results are not necessarily significant. Computational DTI prediction is a shortcut to reduce the risks of experimental methods. In this study, we propose an effective approach of nonnegative matrix tri-factorization, referred to as NMTF-DTI, to predict the interaction scores between drugs and targets. NMTF-DTI utilizes multiple kernels (similarity measures) for drugs and targets and Laplacian regularization to boost the prediction performance. The performance of NMTF-DTI is evaluated via cross-validation and is compared with existing DTI prediction methods in terms of the area under the receiver operating characteristic (ROC) curve (AUC) and the area under the precision and recall curve (AUPR). We evaluate our method on four gold standard datasets, comparing to other state-of-the-art methods. Cross-validation and a separate, manually created dataset are used to set parameters. The results show that NMTF-DTI outperforms other competing methods. Moreover, the results of a case study also confirm the superiority of NMTF-DTI.
Collapse
|
14
|
DeepCellEss: cell line-specific essential protein prediction with attention-based interpretable deep learning. Bioinformatics 2022; 39:6865030. [PMID: 36458923 PMCID: PMC9825760 DOI: 10.1093/bioinformatics/btac779] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2022] [Revised: 11/25/2022] [Accepted: 12/01/2022] [Indexed: 12/05/2022] Open
Abstract
MOTIVATION Protein essentiality is usually accepted to be a conditional trait and strongly affected by cellular environments. However, existing computational methods often do not take such characteristics into account, preferring to incorporate all available data and train a general model for all cell lines. In addition, the lack of model interpretability limits further exploration and analysis of essential protein predictions. RESULTS In this study, we proposed DeepCellEss, a sequence-based interpretable deep learning framework for cell line-specific essential protein predictions. DeepCellEss utilizes a convolutional neural network and bidirectional long short-term memory to learn short- and long-range latent information from protein sequences. Further, a multi-head self-attention mechanism is used to provide residue-level model interpretability. For model construction, we collected extremely large-scale benchmark datasets across 323 cell lines. Extensive computational experiments demonstrate that DeepCellEss yields effective prediction performance for different cell lines and outperforms existing sequence-based methods as well as network-based centrality measures. Finally, we conducted some case studies to illustrate the necessity of considering specific cell lines and the superiority of DeepCellEss. We believe that DeepCellEss can serve as a useful tool for predicting essential proteins across different cell lines. AVAILABILITY AND IMPLEMENTATION The DeepCellEss web server is available at http://csuligroup.com:8000/DeepCellEss. The source code and data underlying this study can be obtained from https://github.com/CSUBioGroup/DeepCellEss. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
15
|
Corrigendum to “Deep learning for brain disorder diagnosis based on fMRI images” [Neurocomputing 469 (2022) 332–345]. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.08.074] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
|
16
|
Abstract
Computational drug repositioning aims to identify potential applications of existing drugs for the treatment of diseases for which they were not designed. This approach can considerably accelerate the traditional drug discovery process by decreasing the required time and costs of drug development. Tensor decomposition enables us to integrate multiple drug- and disease-related data to boost the performance of prediction. In this study, a nonnegative tensor decomposition for drug repositioning, NTD-DR, is proposed. In order to capture the hidden information in drug-target, drug-disease, and target-disease networks, NTD-DR uses these pairwise associations to construct a three-dimensional tensor representing drug-target-disease triplet associations and integrates them with similarity information of drugs, targets, and disease to make a prediction. We compare NTD-DR with recent state-of-the-art methods in terms of the area under the receiver operating characteristic (ROC) curve (AUC) and the area under the precision and recall curve (AUPR) and find that our method outperforms competing methods. Moreover, case studies with five diseases also confirm the reliability of predictions made by NTD-DR. Our proposed method identifies more known associations among the top 50 predictions than other methods. In addition, novel associations identified by NTD-DR are validated by literature analyses.
Collapse
|
17
|
A Dual Ranking Algorithm Based on the Multiplex Network for Heterogeneous Complex Disease Analysis. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1993-2002. [PMID: 33577455 DOI: 10.1109/tcbb.2021.3059046] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Identifying biomarkers of heterogeneous complex diseases has always been one of the focuses in medical research. In previous studies, the powerful network propagation methods have been applied to finding marker genes related to specific diseases, but existing methods are mostly based on a single network, which may be greatly affected by the incompleteness of the network and the ignorance of a large amount of information about physical and functional interactions between biological components. Other methods that directly integrate multiple types of interactions into an aggregate network have the risks that different types of data may conflict with each other and the characteristics and topologies of each individual network are lost. Meanwhile, biomarkers used in clinical trials should have the characteristics of small quantity and strong discriminate ability. In this study, we developed a multiplex network-based dual ranking framework (DualRank) for heterogeneous complex disease analysis. We applied the proposed method to heterogeneous complex diseases for diagnosis, prognosis, and classification. The results showed that DualRank outperformed competing methods and could identify biomarkers with the small quantity, great prediction performance (average AUC = 0.818) and biological interpretability.
Collapse
|
18
|
Drug-Target Interaction Prediction Using Multi-Head Self-Attention and Graph Attention Network. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:2208-2218. [PMID: 33956632 DOI: 10.1109/tcbb.2021.3077905] [Citation(s) in RCA: 21] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Identifying drug-target interactions (DTIs) is an important step in the process of new drug discovery and drug repositioning. Accurate predictions for DTIs can improve the efficiency in the drug discovery and development. Although rapid advances in deep learning technologies have generated various computational methods, it is still appealing to further investigate how to design efficient networks for predicting DTIs. In this study, we propose an end-to-end deep learning method (called MHSADTI) to predict DTIs based on the graph attention network and multi-head self-attention mechanism. First, the characteristics of drugs and proteins are extracted by the graph attention network and multi-head self-attention mechanism, respectively. Then, the attention scores are used to consider which amino acid subsequence in a protein is more important for the drug to predict its interactions. Finally, we predict DTIs by a fully connected layer after obtaining the feature vectors of drugs and proteins. MHSADTI takes advantage of self-attention mechanism for obtaining long-dependent contextual relationship in amino acid sequences and predicting DTI interpretability. More effective molecular characteristics are also obtained by the attention mechanism in graph attention networks. Multiple cross validation experiments are adopted to assess the performance of our MHSADTI. The experiments on four datasets, human, C.elegans, DUD-E and DrugBank show our method outperforms the state-of-the-art methods in terms of AUC, Precision, Recall, AUPR and F1-score. In addition, the case studies further demonstrate that our method can provide effective visualizations to interpret the prediction results from biological insights.
Collapse
|
19
|
CircR2Disease v2.0: An Updated Web Server for Experimentally Validated circRNA-disease Associations and Its Application. GENOMICS, PROTEOMICS & BIOINFORMATICS 2022; 20:435-445. [PMID: 34856391 PMCID: PMC9801044 DOI: 10.1016/j.gpb.2021.10.002] [Citation(s) in RCA: 17] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/31/2020] [Revised: 10/24/2021] [Accepted: 11/24/2021] [Indexed: 01/26/2023]
Abstract
With accumulating dysregulated circular RNAs (circRNAs) in pathological processes, the regulatory functions of circRNAs, especially circRNAs as microRNA (miRNA) sponges and their interactions with RNA-binding proteins (RBPs), have been widely validated. However, the collected information on experimentally validated circRNA-disease associations is only preliminary. Therefore, an updated CircR2Disease database providing a comprehensive resource and web tool to clarify the relationships between circRNAs and diseases in diverse species is necessary. Here, we present an updated CircR2Disease v2.0 with the increased number of circRNA-disease associations and novel characteristics. CircR2Disease v2.0 provides more than 5-fold experimentally validated circRNA-disease associations compared to its previous version. This version includes 4201 entries between 3077 circRNAs and 312 disease subtypes. Secondly, the information of circRNA-miRNA, circRNA-miRNA-target, and circRNA-RBP interactions has been manually collected for various diseases. Thirdly, the gene symbols of circRNAs and disease name IDs can be linked with various nomenclature databases. Detailed descriptions such as samples and journals have also been integrated into the updated version. Thus, CircR2Disease v2.0 can serve as a platform for users to systematically investigate the roles of dysregulated circRNAs in various diseases and further explore the posttranscriptional regulatory function in diseases. Finally, we propose a computational method named circDis based on the graph convolutional network (GCN) and gradient boosting decision tree (GBDT) to illustrate the applications of the CircR2Disease v2.0 database. CircR2Disease v2.0 is available at http://bioinfo.snnu.edu.cn/CircR2Disease_v2.0 and https://github.com/bioinforlab/CircR2Disease-v2.0.
Collapse
|
20
|
Drug Repositioning with GraphSAGE and Clustering Constraints Based on Drug and Disease Networks. Front Pharmacol 2022; 13:872785. [PMID: 35620297 PMCID: PMC9127467 DOI: 10.3389/fphar.2022.872785] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2022] [Accepted: 04/11/2022] [Indexed: 11/29/2022] Open
Abstract
The understanding of therapeutic properties is important in drug repositioning and drug discovery. However, chemical or clinical trials are expensive and inefficient to characterize the therapeutic properties of drugs. Recently, artificial intelligence (AI)-assisted algorithms have received extensive attention for discovering the potential therapeutic properties of drugs and speeding up drug development. In this study, we propose a new method based on GraphSAGE and clustering constraints (DRGCC) to investigate the potential therapeutic properties of drugs for drug repositioning. First, the drug structure features and disease symptom features are extracted. Second, the drug–drug interaction network and disease similarity network are constructed according to the drug–gene and disease–gene relationships. Matrix factorization is adopted to extract the clustering features of networks. Then, all the features are fed to the GraphSAGE to predict new associations between existing drugs and diseases. Benchmark comparisons on two different datasets show that our method has reliable predictive performance and outperforms other six competing. We have also conducted case studies on existing drugs and diseases and aimed to predict drugs that may be effective for the novel coronavirus disease 2019 (COVID-19). Among the predicted anti-COVID-19 drug candidates, some drugs are being clinically studied by pharmacologists, and their binding sites to COVID-19-related protein receptors have been found via the molecular docking technology.
Collapse
|
21
|
An Ensemble Hybrid Feature Selection Method for Neuropsychiatric Disorder Classification. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1459-1471. [PMID: 33471766 DOI: 10.1109/tcbb.2021.3053181] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Magnetic resonance imagings (MRIs) are providing increased access to neuropsychiatric disorders that can be made available for advanced data analysis. However, the single type of data limits the ability of psychiatrists to distinguish the subclasses of this disease. In this paper, we propose an ensemble hybrid features selection method for the neuropsychiatric disorder classification. The method consists of a 3D DenseNet and a XGBoost, which are used to select the image features from structural MRI images and the phenotypic feature from phenotypic records, respectively. The hybrid feature is composed of image features and phenotypic features. The proposed method is validated in the Consortium for Neuropsychiatric Phenomics (CNP) dataset, where samples are classified into one of the four classes (healthy controls (HC), attention deficit hyperactivity disorder (ADHD), bipolar disorder (BD), and schizophrenia (SD)). Experimental results show that the hybrid feature can improve the performance of classification methods. The best accuracy of binary and multi-class classification can reach 91.22 and 78.62 percent, respectively. We analyze the importance of phenotypic features and image features in different classification tasks. The importance of the structure MRI images is highlighted by incorporating phenotypic features with image features to generate hybrid features. We also visualize the features of three neuropsychiatric disorders and analyze their locations in the brain region.
Collapse
|
22
|
DPCMNE: Detecting Protein Complexes From Protein-Protein Interaction Networks Via Multi-Level Network Embedding. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1592-1602. [PMID: 33417563 DOI: 10.1109/tcbb.2021.3050102] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Biological functions of a cell are typically carried out through protein complexes. The detection of protein complexes is therefore of great significance for understanding the cellular organizations and protein functions. In the past decades, many computational methods have been proposed to detect protein complexes. However, most of the existing methods just search the local topological information to mine dense subgraphs as protein complexes, ignoring the global topological information. To tackle this issue, we propose the DPCMNE method to detect protein complexes via multi-level network embedding. It can preserve both the local and global topological information of biological networks. First, DPCMNE employs a hierarchical compressing strategy to recursively compress the input protein-protein interaction (PPI) network into multi-level smaller PPI networks. Then, a network embedding method is applied on these smaller PPI networks to learn protein embeddings of different levels of granularity. The embeddings learned from all the compressed PPI networks are concatenated to represent the final protein embeddings of the original input PPI network. Finally, a core-attachment based strategy is adopted to detect protein complexes in the weighted PPI network constructed by the pairwise similarity of protein embeddings. To assess the efficiency of our proposed method, DPCMNE is compared with other eight clustering algorithms on two yeast datasets. The experimental results show that the performance of DPCMNE outperforms those state-of-the-art complex detection methods in terms of F1 and F1+Acc. Furthermore, the results of functional enrichment analysis indicate that protein complexes detected by DPCMNE are more biologically significant in terms of P-score.
Collapse
|
23
|
Deep learning for aging research with DNA methylation. Curr Bioinform 2022. [DOI: 10.2174/1574893617666220428140637] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Abstract:
Deep learning is burgeoning in various scientific domains from natural language processing [1], computer vision [2], to bioinformatics [3]. However, its development in DNA methylation (DNAm) clocks is at an early stage with a few studies of DNAm clocks which are based on deep learning. In this perspective, we first overview the evolution of DNAm clocks, then introduce some relevant advancements in deep learning, and finally discuss promising directions which may help address the current issues in the existing DNAm clocks.
Collapse
|
24
|
PDMDA: predicting deep-level miRNA-disease associations with graph neural networks and sequence features. Bioinformatics 2022; 38:2226-2234. [PMID: 35150255 DOI: 10.1093/bioinformatics/btac077] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2021] [Revised: 01/18/2022] [Accepted: 02/05/2022] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Many studies have shown that microRNAs (miRNAs) play a key role in human diseases. Meanwhile, traditional experimental methods for miRNA-disease association identification are extremely costly, time-consuming and challenging. Therefore, many computational methods have been developed to predict potential associations between miRNAs and diseases. However, those methods mainly predict the existence of miRNA-disease associations, and they cannot predict the deep-level miRNA-disease association types. RESULTS In this study, we propose a new end-to-end deep learning method (called PDMDA) to predict deep-level miRNA-disease associations with graph neural networks (GNNs) and miRNA sequence features. Based on the sequence and structural features of miRNAs, PDMDA extracts the miRNA feature representations by a fully connected network (FCN). The disease feature representations are extracted from the disease-gene network and gene-gene interaction network by GNN model. Finally, a multilayer with three fully connected layers and a softmax layer is designed to predict the final miRNA-disease association scores based on the concatenated feature representations of miRNAs and diseases. Note that PDMDA does not take the miRNA-disease association matrix as input to compute the Gaussian interaction profile similarity. We conduct three experiments based on six association type samples (including circulations, epigenetics, target, genetics, known association of which their types are unknown and unknown association samples). We conduct fivefold cross-validation validation to assess the prediction performance of PDMDA. The area under the receiver operating characteristic curve scores is used as metric. The experiment results show that PDMDA can accurately predict the deep-level miRNA-disease associations. AVAILABILITY AND IMPLEMENTATION Data and source codes are available at https://github.com/27167199/PDMDA.
Collapse
|
25
|
MLRDFM: a multi-view Laplacian regularized DeepFM model for predicting miRNA-disease associations. Brief Bioinform 2022; 23:6552270. [PMID: 35323901 DOI: 10.1093/bib/bbac079] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2021] [Revised: 02/07/2022] [Accepted: 02/15/2022] [Indexed: 01/20/2023] Open
Abstract
MOTIVATION MicroRNAs (miRNAs), as critical regulators, are involved in various fundamental and vital biological processes, and their abnormalities are closely related to human diseases. Predicting disease-related miRNAs is beneficial to uncovering new biomarkers for the prevention, detection, prognosis, diagnosis and treatment of complex diseases. RESULTS In this study, we propose a multi-view Laplacian regularized deep factorization machine (DeepFM) model, MLRDFM, to predict novel miRNA-disease associations while improving the standard DeepFM. Specifically, MLRDFM improves DeepFM from two aspects: first, MLRDFM takes the relationships among items into consideration by regularizing their embedding features via their similarity-based Laplacians. In this study, miRNA Laplacian regularization integrates four types of miRNA similarity, while disease Laplacian regularization integrates two types of disease similarity. Second, to judiciously train our model, Laplacian eigenmaps are utilized to initialize the weights in the dense embedding layer. The experimental results on the latest HMDD v3.2 dataset show that MLRDFM improves the performance and reduces the overfitting phenomenon of DeepFM. Besides, MLRDFM is greatly superior to the state-of-the-art models in miRNA-disease association prediction in terms of different evaluation metrics with the 5-fold cross-validation. Furthermore, case studies further demonstrate the effectiveness of MLRDFM.
Collapse
|
26
|
HyMM: hybrid method for disease-gene prediction by integrating multiscale module structure. Brief Bioinform 2022; 23:6547263. [PMID: 35275996 DOI: 10.1093/bib/bbac072] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2021] [Revised: 01/18/2022] [Accepted: 02/13/2022] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Identifying disease-related genes is an important issue in computational biology. Module structure widely exists in biomolecule networks, and complex diseases are usually thought to be caused by perturbations of local neighborhoods in the networks, which can provide useful insights for the study of disease-related genes. However, the mining and effective utilization of the module structure is still challenging in such issues as a disease gene prediction. RESULTS We propose a hybrid disease-gene prediction method integrating multiscale module structure (HyMM), which can utilize multiscale information from local to global structure to more effectively predict disease-related genes. HyMM extracts module partitions from local to global scales by multiscale modularity optimization with exponential sampling, and estimates the disease relatedness of genes in partitions by the abundance of disease-related genes within modules. Then, a probabilistic model for integration of gene rankings is designed in order to integrate multiple predictions derived from multiscale module partitions and network propagation, and a parameter estimation strategy based on functional information is proposed to further enhance HyMM's predictive power. By a series of experiments, we reveal the importance of module partitions at different scales, and verify the stable and good performance of HyMM compared with eight other state-of-the-arts and its further performance improvement derived from the parameter estimation. CONCLUSIONS The results confirm that HyMM is an effective framework for integrating multiscale module structure to enhance the ability to predict disease-related genes, which may provide useful insights for the study of the multiscale module structure and its application in such issues as a disease-gene prediction.
Collapse
|
27
|
NIMCE: A Gene Regulatory Network Inference Approach Based on Multi Time Delays Causal Entropy. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1042-1049. [PMID: 33035155 DOI: 10.1109/tcbb.2020.3029846] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Gene regulatory networks (GRNs)are involved in various biological processes, such as cell cycle, differentiation and apoptosis. The existing large amount of expression data, especially the time-series expression data, provide a chance to infer GRNs by computational methods. These data can reveal the dynamics of gene expression and imply the regulatory relationships among genes. However, identify the indirect regulatory links is still a big challenge as most studies treat time points as independent observations, while ignoring the influences of time delays. In this study, we propose a GRN inference method based on information-theory measure, called NIMCE. NIMCE incorporates the transfer entropy to measure the regulatory links between each pair of genes, then applies the causation entropy to filter indirect relationships. In addition, NIMCE applies multi time delays to identify indirect regulatory relationships from candidate genes. Experiments on simulated and colorectal cancer data show NIMCE outperforms than other competing methods. All data and codes used in this study are publicly available at https://github.com/CSUBioGroup/NIMCE.
Collapse
|
28
|
Identifying Gene Signatures for Cancer Drug Repositioning Based on Sample Clustering. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:953-965. [PMID: 32845842 DOI: 10.1109/tcbb.2020.3019781] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Drug repositioning is an important approach for drug discovery. Computational drug repositioning approaches typically use a gene signature to represent a particular disease and connect the gene signature with drug perturbation profiles. Although disease samples, especially from cancer, may be heterogeneous, most existing methods consider them as a homogeneous set to identify differentially expressed genes (DEGs)for further determining a gene signature. As a result, some genes that should be in a gene signature may be averaged off. In this study, we propose a new framework to identify gene signatures for cancer drug repositioning based on sample clustering (GS4CDRSC). GS4CDRSC first groups samples into several clusters based on their gene expression profiles. Second, an existing method is applied to the samples in each cluster for generating a list of DEGs. Then a weighting approach is used to identify an intergrated gene signature from all the lists of DEGs. The integrated gene signature is used to connect with drug perturbation profiles in the Connectivity Map (CMap)database to generate a list of drug candidates. GS4CDRSC has been tested with several cancer datasets and existing methods. The computational results show that GS4CDRSC outperforms those methods without the sample clustering and weighting approaches in terms of both number and rate of predicted known drugs for specific cancers.
Collapse
|
29
|
Biomedical data, computational methods and tools for evaluating disease-disease associations. Brief Bioinform 2022; 23:6522999. [PMID: 35136949 DOI: 10.1093/bib/bbac006] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2021] [Revised: 01/04/2022] [Accepted: 01/05/2022] [Indexed: 12/12/2022] Open
Abstract
In recent decades, exploring potential relationships between diseases has been an active research field. With the rapid accumulation of disease-related biomedical data, a lot of computational methods and tools/platforms have been developed to reveal intrinsic relationship between diseases, which can provide useful insights to the study of complex diseases, e.g. understanding molecular mechanisms of diseases and discovering new treatment of diseases. Human complex diseases involve both external phenotypic abnormalities and complex internal molecular mechanisms in organisms. Computational methods with different types of biomedical data from phenotype to genotype can evaluate disease-disease associations at different levels, providing a comprehensive perspective for understanding diseases. In this review, available biomedical data and databases for evaluating disease-disease associations are first summarized. Then, existing computational methods for disease-disease associations are reviewed and classified into five groups in terms of the usages of biomedical data, including disease semantic-based, phenotype-based, function-based, representation learning-based and text mining-based methods. Further, we summarize software tools/platforms for computation and analysis of disease-disease associations. Finally, we give a discussion and summary on the research of disease-disease associations. This review provides a systematic overview for current disease association research, which could promote the development and applications of computational methods and tools/platforms for disease-disease associations.
Collapse
|
30
|
|
31
|
Predicting Drug-Drug Interactions Based on Integrated Similarity and Semi-Supervised Learning. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:168-179. [PMID: 32310779 DOI: 10.1109/tcbb.2020.2988018] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
A drug-drug interaction (DDI) is defined as an association between two drugs where the pharmacological effects of a drug are influenced by another drug. Positive DDIs can usually improve the therapeutic effects of patients, but negative DDIs cause the major cause of adverse drug reactions and even result in the drug withdrawal from the market and the patient death. Therefore, identifying DDIs has become a key component of the drug development and disease treatment. In this study, we propose a novel method to predict DDIs based on the integrated similarity and semi-supervised learning (DDI-IS-SL). DDI-IS-SL integrates the drug chemical, biological and phenotype data to calculate the feature similarity of drugs with the cosine similarity method. The Gaussian Interaction Profile kernel similarity of drugs is also calculated based on known DDIs. A semi-supervised learning method (the Regularized Least Squares classifier) is used to calculate the interaction possibility scores of drug-drug pairs. In terms of the 5-fold cross validation, 10-fold cross validation and de novo drug validation, DDI-IS-SL can achieve the better prediction performance than other comparative methods. In addition, the average computation time of DDI-IS-SL is shorter than that of other comparative methods. Finally, case studies further demonstrate the performance of DDI-IS-SL in practical applications.
Collapse
|
32
|
|
33
|
|
34
|
An integrated brain-specific network identifies genes associated with neuropathologic and clinical traits of Alzheimer’s disease. Brief Bioinform 2021; 23:6483067. [DOI: 10.1093/bib/bbab522] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2021] [Revised: 10/26/2021] [Accepted: 11/13/2021] [Indexed: 11/12/2022] Open
Abstract
Abstract
Alzheimer’s disease (AD) has a strong genetic predisposition. However, its risk genes remain incompletely identified. We developed an Alzheimer’s brain gene network-based approach to predict AD-associated genes by leveraging the functional pattern of known AD-associated genes. Our constructed network outperformed existing networks in predicting AD genes. We then systematically validated the predictions using independent genetic, transcriptomic, proteomic data, neuropathological and clinical data. First, top-ranked genes were enriched in AD-associated pathways. Second, using external gene expression data from the Mount Sinai Brain Bank study, we found that the top-ranked genes were significantly associated with neuropathological and clinical traits, including the Consortium to Establish a Registry for Alzheimer’s Disease score, Braak stage score and clinical dementia rating. The analysis of Alzheimer’s brain single-cell RNA-seq data revealed cell-type-specific association of predicted genes with early pathology of AD. Third, by interrogating proteomic data in the Religious Orders Study and Memory and Aging Project and Baltimore Longitudinal Study of Aging studies, we observed a significant association of protein expression level with cognitive function and AD clinical severity. The network, method and predictions could become a valuable resource to advance the identification of risk genes for AD.
Collapse
|
35
|
Predicting drug-drug interactions by graph convolutional network with multi-kernel. Brief Bioinform 2021; 23:6447677. [PMID: 34864856 DOI: 10.1093/bib/bbab511] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2021] [Revised: 10/28/2021] [Accepted: 11/07/2021] [Indexed: 11/14/2022] Open
Abstract
Drug repositioning is proposed to find novel usages for existing drugs. Among many types of drug repositioning approaches, predicting drug-drug interactions (DDIs) helps explore the pharmacological functions of drugs and achieves potential drugs for novel treatments. A number of models have been applied to predict DDIs. The DDI network, which is constructed from the known DDIs, is a common part in many of the existing methods. However, the functions of DDIs are different, and thus integrating them in a single DDI graph may overlook some useful information. We propose a graph convolutional network with multi-kernel (GCNMK) to predict potential DDIs. GCNMK adopts two DDI graph kernels for the graph convolutional layers, namely, increased DDI graph consisting of 'increase'-related DDIs and decreased DDI graph consisting of 'decrease'-related DDIs. The learned drug features are fed into a block with three fully connected layers for the DDI prediction. We compare various types of drug features, whereas the target feature of drugs outperforms all other types of features and their concatenated features. In comparison with three different DDI prediction methods, our proposed GCNMK achieves the best performance in terms of area under receiver operating characteristic curve and area under precision-recall curve. In case studies, we identify the top 20 potential DDIs from all unknown DDIs, and the top 10 potential DDIs from the unknown DDIs among breast, colorectal and lung neoplasms-related drugs. Most of them have evidence to support the existence of their interactions. fangxiang.wu@usask.ca.
Collapse
|
36
|
Machine learning and deep learning strategies in drug repositioning. Curr Bioinform 2021. [DOI: 10.2174/1574893616666211119093100] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
:
Drug repositioning is to find novel usages for existing drugs. It plays an important role in drug discovery, especially in the pre-clinical stages. Compared with the traditional drug discovery approaches, computational approaches can save time and reduce cost significantly. Since drug repositioning relies on existing drug-, disease-, and target-centric data, many machine learning (ML) approaches have been proposed to identify useful information from multiple data resources. Deep learning (DL) is a subset of ML and appears in drug repositioning much later than basic ML. Nevertheless, DL methods have shown great performance in predicting potential drugs in many studies. In this article, we review the commonly used basic ML and DL approaches in drug repositioning. Firstly, the related databases are introduced, while all of them are publicly available for researchers. Two types of pre-processing steps, calculating similarities and constructing networks based on those data, are discussed. Secondly, the basic ML and DL strategies are illustrated separately. Thirdly, we review the latest studies about the applications of basic ML and DL in identifying potential drugs through three paths: drug-disease associations, drug-drug interactions, and drug-target interactions. Finally, we discuss the limitations in current studies and suggest several directions of future work to address those limitations.
Collapse
|
37
|
Prognosticating Outcome in Pancreatic Head Cancer With the use of a Machine Learning Algorithm. Technol Cancer Res Treat 2021; 20:15330338211050767. [PMID: 34738844 PMCID: PMC8573477 DOI: 10.1177/15330338211050767] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Background: The purpose of this project is to identify prognostic features in resectable pancreatic head adenocarcinoma and use these features to develop a machine learning algorithm that prognosticates survival for patients pursuing pancreaticoduodenectomy. Methods: A retrospective cohort study of 93 patients who underwent a pancreaticoduodenectomy was performed. The patients were analyzed in 2 groups: Group 1 (n = 38) comprised of patients who survived < 2 years, and Group 2 (n = 55) comprised of patients who survived > 2 years. After comparing the two groups, 9 categorical features and 2 continuous features (11 total) were selected to be statistically significant (p < .05) in predicting outcome after surgery. These 11 features were used to train a machine learning algorithm that prognosticates survival. Results: The algorithm obtained 75% accuracy, 41.9% sensitivity, and 97.5% specificity in predicting whether survival is less than 2 years after surgery. Conclusion: A supervised machine learning algorithm that prognosticates survival can be a useful tool to personalize treatment plans for patients with pancreatic cancer.
Collapse
|
38
|
FUNMarker: Fusion Network-Based Method to Identify Prognostic and Heterogeneous Breast Cancer Biomarkers. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:2483-2491. [PMID: 32070993 DOI: 10.1109/tcbb.2020.2973148] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Breast cancer is a heterogeneous disease with many clinically distinguishable molecular subtypes each corresponding to a cluster of patients. Identification of prognostic and heterogeneous biomarkers for breast cancer is to detect cluster-specific gene biomarkers which can be used for accurate survival prediction of breast cancer outcomes. In this study, we proposed a FUsion Network-based method (FUNMarker) to identify prognostic and heterogeneous breast cancer biomarkers by considering the heterogeneity of patient samples and biological information from multiple sources. To reduce the affect of heterogeneity of patients, samples were first clustered using the K-means algorithm based on the principal components of gene expression. For each cluster, to comprehensively evaluate the influence of genes on breast cancer, genes were weighted from three aspects: biological function, prognostic ability and correlation with known disease genes. Then they were ranked via a label propagation model on a fusion network that combined physical protein interactions from seven types of networks and thus could reduce the impact of incompleteness of interactome. We compared FUNMarker with three state-of-the-art methods and the results showed that biomarkers identified by FUNMarker were biological interpretable and had stronger discriminative power than the existing methods in differentiating patients with different prognostic outcomes.
Collapse
|
39
|
DMFLDA: A Deep Learning Framework for Predicting lncRNA-Disease Associations. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:2353-2363. [PMID: 32248123 DOI: 10.1109/tcbb.2020.2983958] [Citation(s) in RCA: 32] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
A growing amount of evidence suggests that long non-coding RNAs (lncRNAs) play important roles in the regulation of biological processes in many human diseases. However, the number of experimentally verified lncRNA-disease associations is very limited. Thus, various computational approaches are proposed to predict lncRNA-disease associations. Current matrix factorization-based methods cannot capture the complex non-linear relationship between lncRNAs and diseases, and traditional machine learning-based methods are not sufficiently powerful to learn the representation of lncRNAs and diseases. Considering these limitations in existing computational methods, we propose a deep matrix factorization model to predict lncRNA-disease associations (DMFLDA in short). DMFLDA uses a cascade of non-linear hidden layers to learn latent representation to represent lncRNAs and diseases. By using non-linear hidden layers, DMFLDA captures the more complex non-linear relationship between lncRNAs and diseases than traditional matrix factorization-based methods. In addition, DMFLDA learns features directly from the lncRNA-disease interaction matrix and thus can obtain more accurate representation learning for lncRNAs and diseases than traditional machine learning methods. The low dimensional representations of the lncRNAs and diseases are fused to estimate the new interaction value. To evaluate the performance of DMFLDA, we perform leave-one-out cross-validation and 5-fold cross-validation on known experimentally verified lncRNA-disease associations. The experimental results show that DMFLDA performs better than the existing methods. The case studies show that many predicted interactions of colorectal cancer, prostate cancer, and renal cancer have been verified by recent biomedical literature. The source code and datasets can be obtained from https://github.com/CSUBioGroup/DMFLDA.
Collapse
|
40
|
A Deep Learning Framework for Gene Ontology Annotations With Sequence- and Network-Based Information. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:2208-2217. [PMID: 31985440 DOI: 10.1109/tcbb.2020.2968882] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Knowledge of protein functions plays an important role in biology and medicine. With the rapid development of high-throughput technologies, a huge number of proteins have been discovered. However, there are a great number of proteins without functional annotations. A protein usually has multiple functions and some functions or biological processes require interactions of a plurality of proteins. Additionally, Gene Ontology provides a useful classification for protein functions and contains more than 40,000 terms. We propose a deep learning framework called DeepGOA to predict protein functions with protein sequences and protein-protein interaction (PPI) networks. For protein sequences, we extract two types of information: sequence semantic information and subsequence-based features. We use the word2vec technique to numerically represent protein sequences, and utilize a Bi-directional Long and Short Time Memory (Bi-LSTM) and multi-scale convolutional neural network (multi-scale CNN) to obtain the global and local semantic features of protein sequences, respectively. Additionally, we use the InterPro tool to scan protein sequences for extracting subsequence-based information, such as domains and motifs. Then, the information is plugged into a neural network to generate high-quality features. For the PPI network, the Deepwalk algorithm is applied to generate its embedding information of PPI. Then the two types of features are concatenated together to predict protein functions. To evaluate the performance of DeepGOA, several different evaluation methods and metrics are utilized. The experimental results show that DeepGOA outperforms DeepGO and BLAST.
Collapse
|
41
|
Abstract
Disease signature-based drug repositioning approaches typically first identify a disease signature from gene expression profiles of disease samples to represent a particular disease. Then such a disease signature is connected with the drug-induced gene expression profiles to find potential drugs for the particular disease. In order to obtain reliable disease signatures, the size of disease samples should be large enough, which is not always a single case in practice, especially for personalized medicine. On the other hand, the sample sizes of drug-induced gene expression profiles are generally large. In this study, we propose a new drug repositioning approach (HDgS), in which the drug signature is first identified from drug-induced gene expression profiles, and then connected to the gene expression profiles of disease samples to find the potential drugs for patients. In order to take the dependencies among genes into account, the human protein complexes (HPC) are used to define the drug signature. The proposed HDgS is applied to the drug-induced gene expression profiles in LINCS and several types of cancer samples. The results indicate that the HPC-based drug signature can effectively find drug candidates for patients and that the proposed HDgS can be applied for personalized medicine with even one patient sample.
Collapse
|
42
|
A sensitive repeat identification framework based on short and long reads. Nucleic Acids Res 2021; 49:e100. [PMID: 34214175 PMCID: PMC8464074 DOI: 10.1093/nar/gkab563] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2020] [Revised: 06/08/2021] [Accepted: 06/18/2021] [Indexed: 12/11/2022] Open
Abstract
Numerous studies have shown that repetitive regions in genomes play indispensable roles in the evolution, inheritance and variation of living organisms. However, most existing methods cannot achieve satisfactory performance on identifying repeats in terms of both accuracy and size, since NGS reads are too short to identify long repeats whereas SMS (Single Molecule Sequencing) long reads are with high error rates. In this study, we present a novel identification framework, LongRepMarker, based on the global de novo assembly and k-mer based multiple sequence alignment for precisely marking long repeats in genomes. The major characteristics of LongRepMarker are as follows: (i) by introducing barcode linked reads and SMS long reads to assist the assembly of all short paired-end reads, it can identify the repeats to a greater extent; (ii) by finding the overlap sequences between assemblies or chomosomes, it locates the repeats faster and more accurately; (iii) by using the multi-alignment unique k-mers rather than the high frequency k-mers to identify repeats in overlap sequences, it can obtain the repeats more comprehensively and stably; (iv) by applying the parallel alignment model based on the multi-alignment unique k-mers, the efficiency of data processing can be greatly optimized and (v) by taking the corresponding identification strategies, structural variations that occur between repeats can be identified. Comprehensive experimental results show that LongRepMarker can achieve more satisfactory results than the existing de novo detection methods (https://github.com/BioinformaticsCSU/LongRepMarker).
Collapse
|
43
|
DeepLncLoc: a deep learning framework for long non-coding RNA subcellular localization prediction based on subsequence embedding. Brief Bioinform 2021; 23:6366323. [PMID: 34498677 DOI: 10.1093/bib/bbab360] [Citation(s) in RCA: 22] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2021] [Revised: 08/04/2021] [Accepted: 08/16/2021] [Indexed: 11/14/2022] Open
Abstract
Long non-coding RNAs (lncRNAs) are a class of RNA molecules with more than 200 nucleotides. A growing amount of evidence reveals that subcellular localization of lncRNAs can provide valuable insights into their biological functions. Existing computational methods for predicting lncRNA subcellular localization use k-mer features to encode lncRNA sequences. However, the sequence order information is lost by using only k-mer features. We proposed a deep learning framework, DeepLncLoc, to predict lncRNA subcellular localization. In DeepLncLoc, we introduced a new subsequence embedding method that keeps the order information of lncRNA sequences. The subsequence embedding method first divides a sequence into some consecutive subsequences and then extracts the patterns of each subsequence, last combines these patterns to obtain a complete representation of the lncRNA sequence. After that, a text convolutional neural network is employed to learn high-level features and perform the prediction task. Compared with traditional machine learning models, popular representation methods and existing predictors, DeepLncLoc achieved better performance, which shows that DeepLncLoc could effectively predict lncRNA subcellular localization. Our study not only presented a novel computational model for predicting lncRNA subcellular localization but also introduced a new subsequence embedding method which is expected to be applied in other sequence-based prediction tasks. The DeepLncLoc web server is freely accessible at http://bioinformatics.csu.edu.cn/DeepLncLoc/, and source code and datasets can be downloaded from https://github.com/CSUBioGroup/DeepLncLoc.
Collapse
|
44
|
MDAPlatform: A Component-based Platform for Constructing and Assessing miRNA-disease Association Prediction Methods. Curr Bioinform 2021. [DOI: 10.2174/1574893616999210120181506] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
Increasing evidence has indicated that miRNA-disease association prediction plays a critical role
in the study of clinical drugs. Researchers have proposed many computational models for miRNA-disease prediction.
However, there is no unified platform to compare and analyze the pros and cons or share the code and data of these models.
Objective:
In this study, we develop an easy-to-use platform (MDAPlatform) to construct and assess miRNA-disease
association prediction method.
Methods:
MDAPlatform integrates the relevant data of miRNA, disease and miRNA-disease associations
that are used in previous miRNA-disease association prediction studies. Based on the componentized
model, it develops different components of previous computational methods.
Results:
Users can conduct cross validation experiments and compare their methods with other methods, and the visualized
comparison results are also provided.
Conclusion:
Based on the componentized model, MDAPlatform provides easy-to-operate interfaces to construct the
miRNA-disease association method, which is beneficial to develop new miRNA-disease association prediction methods in
the future.
Collapse
|
45
|
NetAUC: A network-based multi-biomarker identification method by AUC optimization. Methods 2021; 198:56-64. [PMID: 34364986 DOI: 10.1016/j.ymeth.2021.08.001] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2021] [Revised: 07/08/2021] [Accepted: 08/03/2021] [Indexed: 10/20/2022] Open
Abstract
Complex diseases are caused by a variety of factors, and their diagnosis, treatment and prognosis are usually difficult. Proteins play an indispensable role in living organisms and perform specific biological functions by interacting with other proteins or biomolecules, their dysfunction may lead to diseases, it is a natural way to mine disease-related biomarkers from protein-protein interaction network. AUC, the area under the receiver operating characteristics (ROC) curve, is regarded as a gold standard to evaluate the effectiveness of a binary classifier, which measures the classification ability of an algorithm under arbitrary distribution or any misclassification cost. In this study, we have proposed a network-based multi-biomarker identification method by AUC optimization (NetAUC), which integrates gene expression and the network information to identify biomarkers for the complex disease analysis. The main purpose is to optimize two objectives simultaneously: maximizing AUC and minimizing the number of selected features. We have applied NetAUC to two types of disease analysis: 1) prognosis of breast cancer, 2) classification of similar diseases. The results show that NetAUC can identify a small panel of disease-related biomarkers which have the powerful classification ability and the functional interpretability.
Collapse
|
46
|
scASK: A Novel Ensemble Framework for Classifying Cell Types Based on Single-cell RNA-seq Data. IEEE J Biomed Health Inform 2021; 25:3230-3239. [PMID: 33434139 DOI: 10.1109/jbhi.2021.3050963] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
The Human Cell Atlas (HCA) is a large project that aims to identify all cell types in the human body. The dimension reduction and clustering for identification of cell types from single-cell RNA-sequencing (scRNA-seq) data have become foundational approaches to HCA. The major challenges of current computational analyses are of poor performance on large scale data and sensitive to initial data. We present a new ensemble framework called Adaptive Slice KNNs (scASK) to address the challenges for analyzing scRNA-seq data with high dimensionality. scASK consists of three innovational modules, called DAS (Data Adaptive Slicing), MCS (Meta Classifiers Selecting) and EMS (Ensemble Mode Switching), respectively, which facilitate scASK to approximate a bias-variance tradeoff beyond classification. Thirteen real scRNA-seq datasets are used to evaluate the performance of scASK. Compared with five popular classification algorithms, our experimental results indicate that scASK achieves the best accuracy and robustness among all competing methods. In conclusion, adaptive slicing is an effective structural reduction procedure, and meanwhile scASK provides novel and robust ensemble framework especially for classifying cell types based on scRNA-seq data. scASK is now publically available at https://github.com/liubo2358/scASKcmd.
Collapse
|
47
|
EPGA-SC : A Framework for de novo Assembly of Single-Cell Sequencing Reads. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1492-1503. [PMID: 31603794 DOI: 10.1109/tcbb.2019.2945761] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Assembling genomes from single-cell sequencing data is essential for single-cell studies. However, single-cell assemblies are challenging due to (i) the highly non-uniform read coverage and (ii) the elevated levels of sequencing errors and chimeric reads. Although several assemblers for single-cell data have been proposed in recent years, most of them fail to construct correct long contigs. In this study, we present a new framework called EPGA-SC for de novo assembly of single-cell sequencing reads. The EPGA assembler has designed strategies to solve the problems caused by sequencing errors, sequencing biases, and repetitive regions. However, the extremely unbalanced and richer error types prevent EPGA to achieve high performance in single-cell sequencing data. In this study, we designed EPGA-SC based on EPGA. The main innovations of EPGA-SC are as follows: (i) classifying reads to reduce the proportion of false reads; (ii) using multiple sets of high precision paired-end reads generated from the high precision assemblies produced by other assembler such as SPAdes to overcome the impact of sequencing biases and repetitive regions; and (iii) developing novel algorithms for removing chimeric errors and extending contigs. We test EPGA-SC with seven datasets. The experimental results show that EPGA-SC can generate better assemblies than most current tools in most time in term of MAX contig, N50, NG50, NA50, and NGA50.
Collapse
|
48
|
Abstract
Essential proteins are a vital part of the survival of organisms and cells. Identification of essential proteins lays a solid foundation for understanding protein functions and discovering drug targets. The traditional biological experiments are expensive and time-consuming. Recently, many computational methods have been proposed. However, some noises in the protein-protein interaction (PPI) networks affect the efficiency of essential protein prediction. It is necessary to construct a credible PPI network by using other useful biological information to reduce the effects of these noises. In this article, we proposed a model, Ess-NEXG, to identify essential proteins, which integrates biological information, including orthologous information, subcellular localization information, RNA-Seq information, and PPI network. In our model, first, we constructed a credible weighted PPI network by using different types of biological information. Second, we extracted the topological features of proteins in the constructed weighted PPI network by using the node2vec technique. Last, we used eXtreme Gradient Boosting (XGBoost) to predict essential proteins by using the topological features of proteins. The extensive results show that our model has better performance than other computational methods.
Collapse
|
49
|
Predicting miRNA-Disease Associations Based on Multi-View Variational Graph Auto-Encoder with Matrix Factorization. IEEE J Biomed Health Inform 2021; 26:446-457. [PMID: 34111017 DOI: 10.1109/jbhi.2021.3088342] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
MicroRNAs (miRNAs) have been proved to play critical roles in diverse biological processes, including the human disease development process. Exploring the potential associations between miRNAs and diseases can help us better understand complex disease mechanisms. Given that traditional biological experiments are expensive and time-consuming, computational models can serve as efficient means to uncover potential miRNA-disease associations. This study presents a new computational model based on variational graph auto-encoder with matrix factorization (VGAMF) for miRNA-disease association prediction. More specifically, VGAMF first integrates four different types of information about miRNAs into an miRNA comprehensive similarity network and two types of information about diseases into a disease comprehensive similarity network, respectively. Then, VGAMF gets the non-linear representations of miRNAs and diseases, respectively, from those two comprehensive similarity networks with variational graph auto-encoders. Simultaneously, a non-negative matrix factorization is conducted on the miRNA-disease association matrix to get the linear representations of miRNAs and diseases. Finally, a fully connected neural network combines linear and non-linear representations of miRNAs and diseases to get the final predicted association score for all miRNA-disease pairs. In the 10-fold cross-validation experiments, VGAMF achieves an average AUC of 0.9280 on HMDD v2.0 and 0.9470 on HMDD v3.2, which outperforms other competing methods. Besides, the case studies on colon cancer and esophageal cancer further demonstrate the effectiveness of VGAMF in predicting novel miRNA-disease associations.
Collapse
|
50
|
IsoResolve: predicting splice isoform functions by integrating gene and isoform-level features with domain adaptation. Bioinformatics 2021; 37:522-530. [PMID: 32966552 PMCID: PMC8088322 DOI: 10.1093/bioinformatics/btaa829] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2020] [Revised: 08/12/2020] [Accepted: 09/09/2020] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION High resolution annotation of gene functions is a central goal in functional genomics. A single gene may produce multiple isoforms with different functions through alternative splicing. Conventional approaches, however, consider a gene as a single entity without differentiating these functionally different isoforms. Towards understanding gene functions at higher resolution, recent efforts have focused on predicting the functions of isoforms. However, the performance of existing methods is far from satisfactory mainly because of the lack of isoform-level functional annotation. RESULTS We present IsoResolve, a novel approach for isoform function prediction, which leverages the information from gene function prediction models with domain adaptation (DA). IsoResolve treats gene-level and isoform-level features as source and target domains, respectively. It uses DA to project the two domains into a latent variable space in such a way that the latent variables from the two domains have similar distribution, which enables the gene domain information to be leveraged for isoform function prediction. We systematically evaluated the performance of IsoResolve in predicting functions. Compared with five state-of-the-art methods, IsoResolve achieved significantly better performance. IsoResolve was further validated by case studies of genes with isoform-level functional annotation. AVAILABILITY AND IMPLEMENTATION IsoResolve is freely available at https://github.com/genemine/IsoResolve. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|