1
|
q-Diffusion leverages the full dimensionality of gene coexpression in single-cell transcriptomics. Commun Biol 2024; 7:400. [PMID: 38565955 DOI: 10.1038/s42003-024-06104-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2023] [Accepted: 03/25/2024] [Indexed: 04/04/2024] Open
Abstract
Unlocking the full dimensionality of single-cell RNA sequencing data (scRNAseq) is the next frontier to a richer, fuller understanding of cell biology. We introduce q-diffusion, a framework for capturing the coexpression structure of an entire library of genes, improving on state-of-the-art analysis tools. The method is demonstrated via three case studies. In the first, q-diffusion helps gain statistical significance for differential effects on patient outcomes when analyzing the CALGB/SWOG 80405 randomized phase III clinical trial, suggesting precision guidance for the treatment of metastatic colorectal cancer. Secondly, q-diffusion is benchmarked against existing scRNAseq classification methods using an in vitro PBMC dataset, in which the proposed method discriminates IFN-γ stimulation more accurately. The same case study demonstrates improvements in unsupervised cell clustering with the recent Tabula Sapiens human atlas. Finally, a local distributional segmentation approach for spatial scRNAseq, driven by q-diffusion, yields interpretable structures of human cortical tissue.
Collapse
|
2
|
Multi-omics revealed the formation mechanism of characteristic volatiles in Tibetan yak cheese induced by different altitudes. Food Chem X 2024; 21:101120. [PMID: 38292682 PMCID: PMC10825365 DOI: 10.1016/j.fochx.2024.101120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2023] [Revised: 12/26/2023] [Accepted: 01/01/2024] [Indexed: 02/01/2024] Open
Abstract
The variation in volatiles, bacteria and metabolites of Tibetan yak cheese (TYC) from different altitudes were characterized with multi-omics to reveal the formation mechanism of characteristic volatile compounds (C-VOCs) in TYC induced by altitudes. 22C-VOCs (odor activity value, OAV > 1) were identified in TYCs, and hexanal, dodecanol, 2,3-butanediol, butyl isobutyate, etc., C-VOCs were confirmed induced by altitude. Lactobacillus, Kocuria, etc., bacteria and benzyl thiocyanate, trehalose, sarcosine, etc., metabolites were screened as the variable bacteria and metabolites for TYCs regulated by altitude, respectively. Pediococcus and carbonhydrates maybe the main contributors for the formation of C-VOCs in TYCs induced by altitudes. The formation of dodecanol, 2,3-butanediol and hexanal maybe derived from sarcosine and EPA, and the generation of butyl isobutyrates maybe originated from 1,6-DP-fructose and threonic acid facilitating by Pediococcus. This research will help us gain insight into the contribution of altitude to the formation of volatiles in TYCs.
Collapse
|
3
|
An efficient consolidation of word embedding and deep learning techniques for classifying anticancer peptides: FastText+BiLSTM. PeerJ Comput Sci 2024; 10:e1831. [PMID: 38435607 PMCID: PMC10909209 DOI: 10.7717/peerj-cs.1831] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2023] [Accepted: 12/31/2023] [Indexed: 03/05/2024]
Abstract
Anticancer peptides (ACPs) are a group of peptides that exhibit antineoplastic properties. The utilization of ACPs in cancer prevention can present a viable substitute for conventional cancer therapeutics, as they possess a higher degree of selectivity and safety. Recent scientific advancements generate an interest in peptide-based therapies which offer the advantage of efficiently treating intended cells without negatively impacting normal cells. However, as the number of peptide sequences continues to increase rapidly, developing a reliable and precise prediction model becomes a challenging task. In this work, our motivation is to advance an efficient model for categorizing anticancer peptides employing the consolidation of word embedding and deep learning models. First, Word2Vec, GloVe, FastText, One-Hot-Encoding approaches are evaluated as embedding techniques for the purpose of extracting peptide sequences. Then, the output of embedding models are fed into deep learning approaches CNN, LSTM, BiLSTM. To demonstrate the contribution of proposed framework, extensive experiments are carried on widely-used datasets in the literature, ACPs250 and independent. Experiment results show the usage of proposed model enhances classification accuracy when compared to the state-of-the-art studies. The proposed combination, FastText+BiLSTM, exhibits 92.50% of accuracy for ACPs250 dataset, and 96.15% of accuracy for the Independent dataset, thence determining new state-of-the-art.
Collapse
|
4
|
A low-cost machine learning framework for predicting drug-drug interactions based on fusion of multiple features and a parameter self-tuning strategy. Phys Chem Chem Phys 2024; 26:6300-6315. [PMID: 38305788 DOI: 10.1039/d4cp00039k] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2024]
Abstract
Poly-drug therapy is now recognized as a crucial treatment, and the analysis of drug-drug interactions (DDIs) offers substantial theoretical support and guidance for its implementation. Predicting potential DDIs using intelligent algorithms is an emerging approach in pharmacological research. However, the existing supervised models and deep learning-based techniques still have several limitations. This paper proposes a novel DDI analysis and prediction framework called the Multi-View Semi-supervised Graph-based (MVSG) framework, which provides a comprehensive judgment by integrating multiple DDI features and functions without any time-consuming training process. Unlike conventional approaches, MVSG can search for the most suitable similarity (or distance) measurement among DDI data and construct graph structures for each feature. By employing a parameter self-tuning strategy, MVSG fuses multiple graphs according to the contributions of features' information. The actual anticancer drug data are extracted from the authoritative public database for evaluating the effectiveness of our framework, including 904 drugs, 7730 DDI records and 19 types of drug interactions. Validation results indicate that the prediction is more accurate when multiple features are adopted by our framework. In comparison to conventional machine learning techniques, MVSG can achieve higher performance even with less labeled data and without a training process. Finally, MVSG is employed to narrow down the search for potential valuable combinations.
Collapse
|
5
|
Whole-genome resource sequences of 57 indigenous Ethiopian goats. Sci Data 2024; 11:139. [PMID: 38287052 PMCID: PMC10825132 DOI: 10.1038/s41597-024-02973-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2023] [Accepted: 01/16/2024] [Indexed: 01/31/2024] Open
Abstract
Domestic goats are distributed worldwide, with approximately 35% of the one billion world goat population occurring in Africa. Ethiopia has 52.5 million goats, ~99.9% of which are considered indigenous landraces deriving from animals introduced to the Horn of Africa in the distant past by nomadic herders. They have continued to be managed by smallholder farmers and semi-mobile pastoralists throughout the region. We report here 57 goat genomes from 12 Ethiopian goat populations sampled from different agro-climates. The data were generated through sequencing DNA samples on the Illumina NovaSeq 6000 platform at a mean depth of 9.71x and 150 bp pair-end reads. In total, ~2 terabytes of raw data were generated, and 99.8% of the clean reads mapped successfully against the goat reference genome assembly at a coverage of 99.6%. About 24.76 million SNPs were generated. These SNPs can be used to study the population structure and genome dynamics of goats at the country, regional, and global levels to shed light on the species' evolutionary trajectory.
Collapse
|
6
|
A high-performance computational workflow to accelerate GATK SNP detection across a 25-genome dataset. BMC Biol 2024; 22:13. [PMID: 38273258 PMCID: PMC10809545 DOI: 10.1186/s12915-024-01820-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2023] [Accepted: 01/09/2024] [Indexed: 01/27/2024] Open
Abstract
BACKGROUND Single-nucleotide polymorphisms (SNPs) are the most widely used form of molecular genetic variation studies. As reference genomes and resequencing data sets expand exponentially, tools must be in place to call SNPs at a similar pace. The genome analysis toolkit (GATK) is one of the most widely used SNP calling software tools publicly available, but unfortunately, high-performance computing versions of this tool have yet to become widely available and affordable. RESULTS Here we report an open-source high-performance computing genome variant calling workflow (HPC-GVCW) for GATK that can run on multiple computing platforms from supercomputers to desktop machines. We benchmarked HPC-GVCW on multiple crop species for performance and accuracy with comparable results with previously published reports (using GATK alone). Finally, we used HPC-GVCW in production mode to call SNPs on a "subpopulation aware" 16-genome rice reference panel with ~ 3000 resequenced rice accessions. The entire process took ~ 16 weeks and resulted in the identification of an average of 27.3 M SNPs/genome and the discovery of ~ 2.3 million novel SNPs that were not present in the flagship reference genome for rice (i.e., IRGSP RefSeq). CONCLUSIONS This study developed an open-source pipeline (HPC-GVCW) to run GATK on HPC platforms, which significantly improved the speed at which SNPs can be called. The workflow is widely applicable as demonstrated successfully for four major crop species with genomes ranging in size from 400 Mb to 2.4 Gb. Using HPC-GVCW in production mode to call SNPs on a 25 multi-crop-reference genome data set produced over 1.1 billion SNPs that were publicly released for functional and breeding studies. For rice, many novel SNPs were identified and were found to reside within genes and open chromatin regions that are predicted to have functional consequences. Combined, our results demonstrate the usefulness of combining a high-performance SNP calling architecture solution with a subpopulation-aware reference genome panel for rapid SNP discovery and public deployment.
Collapse
|
7
|
Chromosome-Level Assembly of Artemia franciscana Sheds Light on Sex Chromosome Differentiation. Genome Biol Evol 2024; 16:evae006. [PMID: 38245839 PMCID: PMC10827361 DOI: 10.1093/gbe/evae006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2023] [Revised: 11/27/2023] [Accepted: 12/21/2023] [Indexed: 01/22/2024] Open
Abstract
Since the commercialization of brine shrimp (genus Artemia) in the 1950s, this lineage, and in particular the model species Artemia franciscana, has been the subject of extensive research. However, our understanding of the genetic mechanisms underlying various aspects of their reproductive biology, including sex determination, is still lacking. This is partly due to the scarcity of genomic resources for Artemia species and crustaceans in general. Here, we present a chromosome-level genome assembly of A. franciscana (Kellogg 1906), from the Great Salt Lake, United States. The genome is 1 GB, and the majority of the genome (81%) is scaffolded into 21 linkage groups using a previously published high-density linkage map. We performed coverage and FST analyses using male and female genomic and transcriptomic reads to quantify the extent of differentiation between the Z and W chromosomes. Additionally, we quantified the expression levels in male and female heads and gonads and found further evidence for dosage compensation in this species.
Collapse
|
8
|
An ecological transcriptome approach to capture the molecular and physiological mechanisms of mass flowering in Shorea curtisii. PeerJ 2023; 11:e16368. [PMID: 38047035 PMCID: PMC10693236 DOI: 10.7717/peerj.16368] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2023] [Accepted: 10/08/2023] [Indexed: 12/05/2023] Open
Abstract
Climatic factors have commonly been attributed as the trigger of general flowering, a unique community-level mass flowering phenomenon involving most dipterocarp species that forms the foundation of Southeast Asian tropical rainforests. This intriguing flowering event is often succeeded by mast fruiting, which provides a temporary yet substantial burst of food resources for animals, particularly frugivores. However, the physiological mechanism that triggers general flowering, particularly in dipterocarp species, is not well understood largely due to its irregular and unpredictable occurrences in the tall and dense forests. To shed light on this mechanism, we employed ecological transcriptomic analyses on an RNA-seq dataset of a general flowering species, Shorea curtisii (Dipterocarpaceae), sequenced from leaves and buds collected at multiple vegetative and flowering phenological stages. We assembled 64,219 unigenes from the transcriptome of which 1,730 and 3,559 were differentially expressed in the leaf and the bud, respectively. Differentially expressed unigene clusters were found to be enriched with homologs of Arabidopsis thaliana genes associated with response to biotic and abiotic stresses, nutrient level, and hormonal treatments. When combined with rainfall data, our transcriptome data reveals that the trees were responding to a brief period of drought prior to the elevated expression of key floral promoters and followed by differential expression of unigenes that indicates physiological changes associated with the transition from vegetative to reproductive stages. Our study is timely for a representative general flowering dipterocarp species that occurs in forests that are under the constant threat of deforestation and climate change as it pinpoints important climate sensitive and flowering-related homologs and offers a glimpse into the cascade of gene expression before and after the onset of floral initiation.
Collapse
|
9
|
Efficient end-to-end long-read sequence mapping using minimap2-fpga integrated with hardware accelerated chaining. Sci Rep 2023; 13:20174. [PMID: 37978244 PMCID: PMC10656460 DOI: 10.1038/s41598-023-47354-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2023] [Accepted: 11/12/2023] [Indexed: 11/19/2023] Open
Abstract
minimap2 is the gold-standard software for reference-based sequence mapping in third-generation long-read sequencing. While minimap2 is relatively fast, further speedup is desirable, especially when processing a multitude of large datasets. In this work, we present minimap2-fpga, a hardware-accelerated version of minimap2 that speeds up the mapping process by integrating an FPGA kernel optimised for chaining. Integrating the FPGA kernel into minimap2 posed significant challenges that we solved by accurately predicting the processing time on hardware while considering data transfer overheads, mitigating hardware scheduling overheads in a multi-threaded environment, and optimizing memory management for processing large realistic datasets. We demonstrate speed-ups in end-to-end run-time for data from both Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio). minimap2-fpga is up to 79% and 53% faster than minimap2 for [Formula: see text] ONT and [Formula: see text] PacBio datasets respectively, when mapping without base-level alignment. When mapping with base-level alignment, minimap2-fpga is up to 62% and 10% faster than minimap2 for [Formula: see text] ONT and [Formula: see text] PacBio datasets respectively. The accuracy is near-identical to that of original minimap2 for both ONT and PacBio data, when mapping both with and without base-level alignment. minimap2-fpga is supported on Intel FPGA-based systems (evaluations performed on an on-premise system) and Xilinx FPGA-based systems (evaluations performed on a cloud system). We also provide a well-documented library for the FPGA-accelerated chaining kernel to be used by future researchers developing sequence alignment software with limited hardware background.
Collapse
|
10
|
Alignment-based Protein Mutational Landscape Prediction: Doing More with Less. Genome Biol Evol 2023; 15:evad201. [PMID: 37936309 PMCID: PMC10653582 DOI: 10.1093/gbe/evad201] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2023] [Revised: 10/27/2023] [Accepted: 11/01/2023] [Indexed: 11/09/2023] Open
Abstract
The wealth of genomic data has boosted the development of computational methods predicting the phenotypic outcomes of missense variants. The most accurate ones exploit multiple sequence alignments, which can be costly to generate. Recent efforts for democratizing protein structure prediction have overcome this bottleneck by leveraging the fast homology search of MMseqs2. Here, we show the usefulness of this strategy for mutational outcome prediction through a large-scale assessment of 1.5M missense variants across 72 protein families. Our study demonstrates the feasibility of producing alignment-based mutational landscape predictions that are both high-quality and compute-efficient for entire proteomes. We provide the community with the whole human proteome mutational landscape and simplified access to our predictive pipeline.
Collapse
|
11
|
Fibration symmetry uncovers minimal regulatory networks for logical computation in bacteria. ARXIV 2023:arXiv:2310.10895v1. [PMID: 37904746 PMCID: PMC10614959] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 11/01/2023]
Abstract
Symmetry principles have proven important in physics, deep learning and geometry, allowing for the reduction of complicated systems to simpler, more comprehensible models that preserve the system's features of interest. Biological systems often show a high level of complexity and consist of a high number of interacting parts. Using symmetry fibrations, the relevant symmetries for biological 'message-passing' networks, we reduced the gene regulatory networks of E. coli and B. subtilis bacteria in a way that preserves information flow and highlights the computational capabilities of the network. Nodes that share isomorphic input trees are grouped into equivalence classes called fibers, whereby genes that receive signals with the same 'history' belong to one fiber and synchronize. We further reduce the networks to its computational core by removing "dangling ends" via k-core decomposition. The computational core of the network consists of a few strongly connected components in which signals can cycle while signals are transmitted between these "information vortices" in a linear feed-forward manner. These components are in charge of decision making in the bacterial cell by employing a series of genetic toggle-switch circuits that store memory, and oscillator circuits. These circuits act as the central computation machine of the network, whose output signals then spread to the rest of the network.
Collapse
|
12
|
A generalizable Cas9/sgRNA prediction model using machine transfer learning with small high-quality datasets. Nat Commun 2023; 14:5514. [PMID: 37679324 PMCID: PMC10485023 DOI: 10.1038/s41467-023-41143-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2023] [Accepted: 08/24/2023] [Indexed: 09/09/2023] Open
Abstract
The CRISPR/Cas9 nuclease from Streptococcus pyogenes (SpCas9) can be used with single guide RNAs (sgRNAs) as a sequence-specific antimicrobial agent and as a genome-engineering tool. However, current bacterial sgRNA activity models struggle with accurate predictions and do not generalize well, possibly because the underlying datasets used to train the models do not accurately measure SpCas9/sgRNA activity and cannot distinguish on-target cleavage from toxicity. Here, we solve this problem by using a two-plasmid positive selection system to generate high-quality data that more accurately reports on SpCas9/sgRNA cleavage and that separates activity from toxicity. We develop a machine learning architecture (crisprHAL) that can be trained on existing datasets, that shows marked improvements in sgRNA activity prediction accuracy when transfer learning is used with small amounts of high-quality data, and that can generalize predictions to different bacteria. The crisprHAL model recapitulates known SpCas9/sgRNA-target DNA interactions and provides a pathway to a generalizable sgRNA bacterial activity prediction tool that will enable accurate antimicrobial and genome engineering applications.
Collapse
|
13
|
A reliable QSPR model for predicting drug release rate from metal-organic frameworks: a simple and robust drug delivery approach. RSC Adv 2023; 13:24617-24627. [PMID: 37601598 PMCID: PMC10432896 DOI: 10.1039/d3ra00070b] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2023] [Accepted: 06/05/2023] [Indexed: 08/22/2023] Open
Abstract
During the drug release process, the drug is transferred from the starting point in the drug delivery system to the surface, and then to the release medium. Metal-organic frameworks (MOFs) potentially have unique features to be utilized as promising carriers for drug delivery, due to their suitable pore size, high surface area, and structural flexibility. The loading and release of various therapeutic drugs through the MOFs are effectively accomplished due to their tunable inorganic clusters and organic ligands. Since the drug release rate percentage (RES%) is a significant concern, a quantitative structure-property relationship (QSPR) method was applied to achieve an accurate model predicting the drug release rate from MOFs. Structure-based descriptors, including the number of nitrogen and oxygen atoms, along with two other adjusted descriptors, were applied for obtaining the best multilinear regression (BMLR) model. Drug release rates from 67 MOFs were applied to provide a precise model. The coefficients of determination (R2) for the training and test sets obtained were both 0.9999. The root mean square error for prediction (RMSEP) of the RES% values for the training and test sets were 0.006 and 0.005, respectively. To examine the precision of the model, external validation was performed through a set of new observations, which demonstrated that the model works to a satisfactory degree.
Collapse
|
14
|
BMX: Biological modelling and interface exchange. Sci Rep 2023; 13:12235. [PMID: 37507417 PMCID: PMC10382537 DOI: 10.1038/s41598-023-39150-1] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2023] [Accepted: 07/20/2023] [Indexed: 07/30/2023] Open
Abstract
High performance computing has a great potential to provide a range of significant benefits for investigating biological systems. These systems often present large modelling problems with many coupled subsystems, such as when studying colonies of bacteria cells. The aim to understand cell colonies has generated substantial interest as they can have strong economic and societal impacts through their roles in in industrial bioreactors and complex community structures, called biofilms, found in clinical settings. Investigating these communities through realistic models can rapidly exceed the capabilities of current serial software. Here, we introduce BMX, a software system developed for the high performance modelling of large cell communities by utilising GPU acceleration. BMX builds upon the AMRex adaptive mesh refinement package to efficiently model cell colony formation under realistic laboratory conditions. Using simple test scenarios with varying nutrient availability, we show that BMX is capable of correctly reproducing observed behavior of bacterial colonies on realistic time scales demonstrating a potential application of high performance computing to colony modelling. The open source software is available from the zenodo repository https://doi.org/10.5281/zenodo.8084270 under the BSD-2-Clause licence.
Collapse
|
15
|
smAMPsTK: a toolkit to unravel the smORFome encoding AMPs of plant species. J Biomol Struct Dyn 2023:1-13. [PMID: 37464885 DOI: 10.1080/07391102.2023.2235605] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2023] [Accepted: 07/06/2023] [Indexed: 07/20/2023]
Abstract
The pervasive repertoire of plant molecules with the potential to serve as a substitute for conventional antibiotics has led to obtaining better insights into plant-derived antimicrobial peptides (AMPs). The massive distribution of Small Open Reading Frames (smORFs) throughout eukaryotic genomes with proven extensive biological functions reflects their practicality as antimicrobials. Here, we have developed a pipeline named smAMPsTK to unveil the underlying hidden smORFs encoding AMPs for plant species. By applying this pipeline, we have elicited AMPs of various functional activity of lengths ranging from 5 to 100 aa by employing publicly available transcriptome data of five different angiosperms. Later, we studied the coding potential of AMPs-smORFs, the inclusion of diverse translation initiation start codons, and amino acid frequency. Codon usage study signifies no such codon usage biases for smORFs encoding AMPs. Majorly three start codons are prominent in generating AMPs. The evolutionary and conservational study proclaimed the widespread distribution of AMPs encoding genes throughout the plant kingdom. Domain analysis revealed that nearly all AMPs have chitin-binding ability, establishing their role as antifungal agents. The current study includes a developed methodology to characterize smORFs encoding AMPs, and their implications as antimicrobial, antibacterial, antifungal, or antiviral provided by SVM score and prediction status calculated by machine learning-based prediction models. The pipeline, complete package, and the results derived for five angiosperms are freely available at https://github.com/skbinfo/smAMPsTK.Communicated by Ramaswamy H. Sarma.
Collapse
|
16
|
A large expert-curated cryo-EM image dataset for machine learning protein particle picking. Sci Data 2023; 10:392. [PMID: 37349345 PMCID: PMC10287764 DOI: 10.1038/s41597-023-02280-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2023] [Accepted: 05/30/2023] [Indexed: 06/24/2023] Open
Abstract
Cryo-electron microscopy (cryo-EM) is a powerful technique for determining the structures of biological macromolecular complexes. Picking single-protein particles from cryo-EM micrographs is a crucial step in reconstructing protein structures. However, the widely used template-based particle picking process is labor-intensive and time-consuming. Though machine learning and artificial intelligence (AI) based particle picking can potentially automate the process, its development is hindered by lack of large, high-quality labelled training data. To address this bottleneck, we present CryoPPP, a large, diverse, expert-curated cryo-EM image dataset for protein particle picking and analysis. It consists of labelled cryo-EM micrographs (images) of 34 representative protein datasets selected from the Electron Microscopy Public Image Archive (EMPIAR). The dataset is 2.6 terabytes and includes 9,893 high-resolution micrographs with labelled protein particle coordinates. The labelling process was rigorously validated through 2D particle class validation and 3D density map validation with the gold standard. The dataset is expected to greatly facilitate the development of both AI and classical methods for automated cryo-EM protein particle picking.
Collapse
|
17
|
Align-gram: Rethinking the Skip-gram Model for Protein Sequence Analysis. Protein J 2023; 42:135-146. [PMID: 36977849 DOI: 10.1007/s10930-023-10096-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/13/2023] [Indexed: 03/29/2023]
Abstract
The inception of next generations sequencing technologies have exponentially increased the volume of biological sequence data. Protein sequences, being quoted as the 'language of life', has been analyzed for a multitude of applications and inferences. Owing to the rapid development of deep learning, in recent years there have been a number of breakthroughs in the domain of Natural Language Processing. Since these methods are capable of performing different tasks when trained with a sufficient amount of data, off-the-shelf models are used to perform various biological applications. In this study, we investigated the applicability of the popular Skip-gram model for protein sequence analysis and made an attempt to incorporate some biological insights into it. We propose a novel k-mer embedding scheme, Align-gram, which is capable of mapping the similar k-mers close to each other in a vector space. Furthermore, we experiment with other sequence-based protein representations and observe that the embeddings derived from Align-gram aids modeling and training deep learning models better. Our experiments with a simple baseline LSTM model and a much complex CNN model of DeepGoPlus shows the potential of Align-gram in performing different types of deep learning applications for protein sequence analysis.
Collapse
|
18
|
Classification of Alzheimer's disease using robust TabNet neural networks on genetic data. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2023; 20:8358-8374. [PMID: 37161202 DOI: 10.3934/mbe.2023366] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/11/2023]
Abstract
Alzheimer's disease (AD) is one of the most common neurodegenerative diseases and its onset is significantly associated with genetic factors. Being the capabilities of high specificity and accuracy, genetic testing has been considered as an important technique for AD diagnosis. In this paper, we presented an improved deep learning (DL) algorithm, namely differential genes screening TabNet (DGS-TabNet) for AD binary and multi-class classifications. For performance evaluation, our proposed approach was compared with three novel DLs of multi-layer perceptron (MLP), neural oblivious decision ensembles (NODE), TabNet as well as five classical machine learnings (MLs) including decision tree (DT), random forests (RF), gradient boosting decision tree (GBDT), light gradient boosting machine (LGBM) and support vector machine (SVM) on the public data set of gene expression omnibus (GEO). Moreover, the biological interpretability of global important genetic features implemented for AD classification was revealed by the Kyoto encyclopedia of genes and genomes (KEGG) and gene ontology (GO). The results demonstrated that our proposed DGS-TabNet achieved the best performance with an accuracy of 93.80% for binary classification, and with an accuracy of 88.27% for multi-class classification. Meanwhile, the gene pathway analyses demonstrated that there existed two most important global genetic features of AVIL and NDUFS4 and those obtained 22 feature genes were partially correlated with AD pathogenesis. It was concluded that the proposed DGS-TabNet could be used to detect AD-susceptible genes and the biological interpretability of susceptible genes also revealed the potential possibility of being AD biomarkers.
Collapse
|
19
|
A prefix and attention map discrimination fusion guided attention for biomedical named entity recognition. BMC Bioinformatics 2023; 24:42. [PMID: 36755230 PMCID: PMC9907889 DOI: 10.1186/s12859-023-05172-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2022] [Accepted: 02/03/2023] [Indexed: 02/10/2023] Open
Abstract
BACKGROUND The biomedical literature is growing rapidly, and it is increasingly important to extract meaningful information from the vast amount of literature. Biomedical named entity recognition (BioNER) is one of the key and fundamental tasks in biomedical text mining. It also acts as a primitive step for many downstream applications such as relation extraction and knowledge base completion. Therefore, the accurate identification of entities in biomedical literature has certain research value. However, this task is challenging due to the insufficiency of sequence labeling and the lack of large-scale labeled training data and domain knowledge. RESULTS In this paper, we use a novel word-pair classification method, design a simple attention mechanism and propose a novel architecture to solve the research difficulties of BioNER more efficiently without leveraging any external knowledge. Specifically, we break down the limitations of sequence labeling-based approaches by predicting the relationship between word pairs. Based on this, we enhance the pre-trained model BioBERT, through the proposed prefix and attention map dscrimination fusion guided attention and propose the E-BioBERT. Our proposed attention differentiates the distribution of different heads in different layers in the BioBERT, which enriches the diversity of self-attention. Our model is superior to state-of-the-art compared models on five available datasets: BC4CHEMD, BC2GM, BC5CDR-Disease, BC5CDR-Chem, and NCBI-Disease, achieving F1-score of 92.55%, 85.45%, 87.53%, 94.16% and 90.55%, respectively. CONCLUSION Compared with many previous various models, our method does not require additional training datasets, external knowledge, and complex training process. The experimental results on five BioNER benchmark datasets demonstrate that our model is better at mining semantic information, alleviating the problem of label inconsistency, and has higher entity recognition ability. More importantly, we analyze and demonstrate the effectiveness of our proposed attention.
Collapse
|
20
|
A novel automated robust dual-channel EEG-based sleep scoring system using optimal half-band pair linear-phase biorthogonal wavelet filter bank. APPL INTELL 2023; 53:1-19. [PMID: 36777881 PMCID: PMC9906594 DOI: 10.1007/s10489-022-04432-0] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/21/2022] [Indexed: 02/11/2023]
Abstract
Nowadays, the hectic work life of people has led to sleep deprivation. This may further result in sleep-related disorders and adverse physiological conditions. Therefore, sleep study has become an active research area. Sleep scoring is crucial for detecting sleep-related disorders like sleep apnea, insomnia, narcolepsy, periodic leg movement (PLM), and restless leg syndrome (RLS). Sleep is conventionally monitored in a sleep laboratory using polysomnography (PSG) which is the recording of various physiological signals. The traditional sleep stage scoring (SSG) done by professional sleep scorers is a tedious, strenuous, and time-consuming process as it is manual. Hence, developing a machine-learning model for automatic SSG is essential. In this study, we propose an automated SSG approach based on the biorthogonal wavelet filter bank's (BWFB) novel least squares (LS) design. We have utilized a huge Wisconsin sleep cohort (WSC) database in this study. The proposed study is a pioneering work on automatic sleep stage classification using the WSC database, which includes good sleepers and patients suffering from various sleep-related disorders, including apnea, insomnia, hypertension, diabetes, and asthma. To investigate the generalization of the proposed system, we evaluated the proposed model with the following publicly available databases: cyclic alternating pattern (CAP), sleep EDF, ISRUC, MIT-BIH, and the sleep apnea database from St. Vincent's University. This study uses only two unipolar EEG channels, namely O1-M2 and C3-M2, for the scoring. The Hjorth parameters (HP) are extracted from the wavelet subbands (SBS) that are obtained from the optimal BWFB. To classify sleep stages, the HP features are fed to several supervised machine learning classifiers. 12 different datasets have been created to develop a robust model. A total of 12 classification tasks (CT) have been conducted employing various classification algorithms. Our developed model achieved the best accuracy of 83.2% and Cohen's Kappa of 0.7345 to reliably distinguish five sleep stages, using an ensemble bagged tree classifier with 10-fold cross-validation using WSC data. We also observed that our system is either better or competitive with existing state-of-art systems when we tested with the above-mentioned five databases other than WSC. This method yielded promising results using only two EEG channels using a huge WSC database. Our approach is simple and hence, the developed model can be installed in home-based clinical systems and wearable devices for sleep scoring.
Collapse
|
21
|
Deciphering sperm functions using biological networks. Biotechnol Genet Eng Rev 2023:1-25. [PMID: 36722689 DOI: 10.1080/02648725.2023.2168912] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2022] [Indexed: 02/02/2023]
Abstract
The global human population is exponentially increasing, which requires the production of quality food through efficient reproduction as well as sustainable production of livestock. Lack of knowledge and technology for assessing semen quality and predicting bull fertility is hindering advances in animal science and food animal production and causing millions of dollars of economic losses annually. The intent of this systemic review is to summarize methods from computational biology for analysis of gene, metabolite, and protein networks to identify potential markers that can be applied to improve livestock reproduction, with a focus on bull fertility. We provide examples of available gene, metabolic, and protein networks and computational biology methods to show how the interactions between genes, proteins, and metabolites together drive the complex process of spermatogenesis and regulate fertility in animals. We demonstrate the use of the National Center for Biotechnology Information (NCBI) and Ensembl for finding gene sequences, and then use them to create and understand gene, protein and metabolite networks for sperm associated factors to elucidate global cellular processes in sperm. This study highlights the value of mapping complex biological pathways among livestock and potential for conducting studies on promoting livestock improvement for global food security.
Collapse
|
22
|
Systematic and benchmarking studies of pipelines for mammal WGBS data in the novel NGS platform. BMC Bioinformatics 2023; 24:33. [PMID: 36721080 PMCID: PMC9890740 DOI: 10.1186/s12859-023-05163-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2022] [Accepted: 01/27/2023] [Indexed: 02/01/2023] Open
Abstract
BACKGROUND Whole genome bisulfite sequencing (WGBS), possesses the aptitude to dissect methylation status at the nucleotide-level resolution of 5-methylcytosine (5-mC) on a genome-wide scale. It is a powerful technique for epigenome in various cell types, and tissues. As a recently established next-generation sequencing (NGS) platform, GenoLab M is a promising alternative platform. However, its comprehensive evaluation for WGBS has not been reported. We sequenced two bisulfite-converted mammal DNA in this research using our GenoLab M and NovaSeq 6000, respectively. Then, we systematically compared those data via four widely used WGBS tools (BSMAP, Bismark, BatMeth2, BS-Seeker2) and a new bisulfite-seq tool (BSBolt). We interrogated their computational time, genome depth and coverage, and evaluated their percentage of methylated Cs. RESULT Here, benchmarking a combination of pre- and post-processing methods, we found that trimming improved the performance of mapping efficiency in eight datasets. The data from two platforms uncovered ~ 80% of CpG sites genome-wide in the human cell line. Those data sequenced by GenoLab M achieved a far lower proportion of duplicates (~ 5.5%). Among pipelines, BSMAP provided an intriguing representation of 5-mC distribution at CpG sites with 5-mC levels > ~ 78% in datasets from human cell lines, especially in the GenoLab M. BSMAP performed more advantages in running time, uniquely mapped reads percentages, genomic coverage, and quantitative accuracy. Finally, compared with the previous methylation pattern of human cell line and mouse tissue, we confirmed that the data from GenoLab M performed similar consistency and accuracy in methylation levels of CpG sites with that from NovaSeq 6000. CONCLUSION Together we confirmed that GenoLab M was a qualified NGS platform for WGBS with high performance. Our results showed that BSMAP was the suitable pipeline that allowed for WGBS studies on the GenoLab M platform.
Collapse
|
23
|
DeepMPF: deep learning framework for predicting drug-target interactions based on multi-modal representation with meta-path semantic analysis. J Transl Med 2023; 21:48. [PMID: 36698208 PMCID: PMC9876420 DOI: 10.1186/s12967-023-03876-3] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2022] [Accepted: 01/05/2023] [Indexed: 01/26/2023] Open
Abstract
BACKGROUND Drug-target interaction (DTI) prediction has become a crucial prerequisite in drug design and drug discovery. However, the traditional biological experiment is time-consuming and expensive, as there are abundant complex interactions present in the large size of genomic and chemical spaces. For alleviating this phenomenon, plenty of computational methods are conducted to effectively complement biological experiments and narrow the search spaces into a preferred candidate domain. Whereas, most of the previous approaches cannot fully consider association behavior semantic information based on several schemas to represent complex the structure of heterogeneous biological networks. Additionally, the prediction of DTI based on single modalities cannot satisfy the demand for prediction accuracy. METHODS We propose a multi-modal representation framework of 'DeepMPF' based on meta-path semantic analysis, which effectively utilizes heterogeneous information to predict DTI. Specifically, we first construct protein-drug-disease heterogeneous networks composed of three entities. Then the feature information is obtained under three views, containing sequence modality, heterogeneous structure modality and similarity modality. We proposed six representative schemas of meta-path to preserve the high-order nonlinear structure and catch hidden structural information of the heterogeneous network. Finally, DeepMPF generates highly representative comprehensive feature descriptors and calculates the probability of interaction through joint learning. RESULTS To evaluate the predictive performance of DeepMPF, comparison experiments are conducted on four gold datasets. Our method can obtain competitive performance in all datasets. We also explore the influence of the different feature embedding dimensions, learning strategies and classification methods. Meaningfully, the drug repositioning experiments on COVID-19 and HIV demonstrate DeepMPF can be applied to solve problems in reality and help drug discovery. The further analysis of molecular docking experiments enhances the credibility of the drug candidates predicted by DeepMPF. CONCLUSIONS All the results demonstrate the effectively predictive capability of DeepMPF for drug-target interactions. It can be utilized as a useful tool to prescreen the most potential drug candidates for the protein. The web server of the DeepMPF predictor is freely available at http://120.77.11.78/DeepMPF/ , which can help relevant researchers to further study.
Collapse
|
24
|
Interpreting the molecular mechanisms of disease variants in human transmembrane proteins. Biophys J 2023:S0006-3495(22)03941-8. [PMID: 36600598 DOI: 10.1016/j.bpj.2022.12.031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2022] [Revised: 11/19/2022] [Accepted: 12/21/2022] [Indexed: 01/06/2023] Open
Abstract
Next-generation sequencing of human genomes reveals millions of missense variants, some of which may lead to loss of protein function and ultimately disease. Here, we investigate missense variants in membrane proteins-key drivers in cell signaling and recognition. We find enrichment of pathogenic variants in the transmembrane region across 19,000 functionally classified variants in human membrane proteins. To accurately predict variant consequences, one fundamentally needs to understand the underlying molecular processes. A key mechanism underlying pathogenicity in missense variants of soluble proteins has been shown to be loss of stability. Membrane proteins, however, are widely understudied. Here, we interpret variant effects on a larger scale by performing structure-based estimations of changes in thermodynamic stability using a membrane-specific energy function and analyses of sequence conservation during evolution of 15 transmembrane proteins. We find evidence for loss of stability being the cause of pathogenicity in more than half of the pathogenic variants, indicating that this is a driving factor also in membrane-protein-associated diseases. Our findings show how computational tools aid in gaining mechanistic insights into variant consequences for membrane proteins. To enable broader analyses of disease-related and population variants, we include variant mappings for the entire human proteome.
Collapse
|
25
|
Target Metabolome and Transcriptome Analysis Reveal Molecular Mechanism Associated with Changes of Tea Quality at Different Development Stages. Mol Biotechnol 2023; 65:52-60. [PMID: 35780278 DOI: 10.1007/s12033-022-00525-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2022] [Accepted: 06/14/2022] [Indexed: 01/11/2023]
Abstract
This study aimed to explore the molecular mechanisms underlying the differential quality of tea made from leaves at different development stages. Fresh Camellia sinensis (L.) O. Kuntze "Sichuan Colonial" leaves of various development stages, from buds to old leaves, were subjected to transcriptome sequencing and metabolome analysis, and the DESeq package was used for differential expression analysis, followed by functional enrichment analyses and protein interaction analysis. Target metabolome analysis indicated that the contents of most compounds, including theobromine and epicatechin gallate, were lowest in old leaves, and transcriptome analysis revealed that DEGs were significantly involved in extracellular regions and phenylpropanoid biosynthesis, photosynthesis-related pathways, and the oleuropein steroid biosynthesis pathway. Protein-protein interaction analysis identified LOC114256852 as a hub gene. Caffeine, theobromine, L-theanine, and catechins were the main metabolites of the tea leaves, and the contents of all four main metabolites were the lowest in old leaves. Phenylpropanoid biosynthesis, photosynthesis, and brassinosteroid biosynthesis may be important targets for breeding efforts to improve tea quality.
Collapse
|
26
|
Protein secondary structure prediction based on Wasserstein generative adversarial networks and temporal convolutional networks with convolutional block attention modules. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2023; 20:2203-2218. [PMID: 36899529 DOI: 10.3934/mbe.2023102] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/18/2023]
Abstract
As an important task in bioinformatics, protein secondary structure prediction (PSSP) is not only beneficial to protein function research and tertiary structure prediction, but also to promote the design and development of new drugs. However, current PSSP methods cannot sufficiently extract effective features. In this study, we propose a novel deep learning model WGACSTCN, which combines Wasserstein generative adversarial network with gradient penalty (WGAN-GP), convolutional block attention module (CBAM) and temporal convolutional network (TCN) for 3-state and 8-state PSSP. In the proposed model, the mutual game of generator and discriminator in WGAN-GP module can effectively extract protein features, and our CBAM-TCN local extraction module can capture key deep local interactions in protein sequences segmented by sliding window technique, and the CBAM-TCN long-range extraction module can further capture the key deep long-range interactions in sequences. We evaluate the performance of the proposed model on seven benchmark datasets. Experimental results show that our model exhibits better prediction performance compared to the four state-of-the-art models. The proposed model has strong feature extraction ability, which can extract important information more comprehensively.
Collapse
|
27
|
The curse and blessing of abundance-the evolution of drug interaction databases and their impact on drug network analysis. Gigascience 2022; 12:giad011. [PMID: 36892110 PMCID: PMC10023830 DOI: 10.1093/gigascience/giad011] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2022] [Revised: 11/18/2022] [Accepted: 02/07/2023] [Indexed: 03/10/2023] Open
Abstract
BACKGROUND Widespread bioinformatics applications such as drug repositioning or drug-drug interaction prediction rely on the recent advances in machine learning, complex network science, and comprehensive drug datasets comprising the latest research results in molecular biology, biochemistry, or pharmacology. The problem is that there is much uncertainty in these drug datasets-we know the drug-drug or drug-target interactions reported in the research papers, but we cannot know if the not reported interactions are absent or yet to be discovered. This uncertainty hampers the accuracy of such bioinformatics applications. RESULTS We use complex network statistics tools and simulations of randomly inserted previously unaccounted interactions in drug-drug and drug-target interaction networks-built with data from DrugBank versions released over the plast decade-to investigate whether the abundance of new research data (included in the latest dataset versions) mitigates the uncertainty issue. Our results show that the drug-drug interaction networks built with the latest dataset versions become very dense and, therefore, almost impossible to analyze with conventional complex network methods. On the other hand, for the latest drug database versions, drug-target networks still include much uncertainty; however, the robustness of complex network analysis methods slightly improves. CONCLUSIONS Our big data analysis results pinpoint future research directions to improve the quality and practicality of drug databases for bioinformatics applications: benchmarking for drug-target interaction prediction and drug-drug interaction severity standardization.
Collapse
|
28
|
Multi-dimensional feature recognition model based on capsule network for ubiquitination site prediction. PeerJ 2022; 10:e14427. [PMID: 36523471 PMCID: PMC9745908 DOI: 10.7717/peerj.14427] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2022] [Accepted: 10/30/2022] [Indexed: 12/12/2022] Open
Abstract
Ubiquitination is an important post-translational modification of proteins that regulates many cellular activities. Traditional experimental methods for identification are costly and time-consuming, so many researchers have proposed computational methods for ubiquitination site prediction in recent years. However, traditional machine learning methods focus on feature engineering and are not suitable for large-scale proteomic data. In addition, deep learning methods are mostly based on convolutional neural networks and fuse multiple coding approaches to achieve classification prediction. This cannot effectively identify potential fine-grained features of the input data and has limitations in the representation of dependencies between low-level features and high-level features. A multi-dimensional feature recognition model based on a capsule network (MDCapsUbi) was proposed to predict protein ubiquitination sites. The proposed module consisting of convolution operations and channel attention was used to recognize coarse-grained features in the sequence dimension and the feature map dimension. The capsule network module consisting of capsule vectors was used to identify fine-grained features and classify ubiquitinated sites. With ten-fold cross-validation, the MDCapsUbi achieved 91.82% accuracy, 91.39% sensitivity, 92.24% specificity, 0.837 MCC, 0.918 F-Score and 0.97 AUC. Experimental results indicated that the proposed method outperformed other ubiquitination site prediction technologies.
Collapse
|
29
|
Accuracy benchmark of the GeneMind GenoLab M sequencing platform for WGS and WES analysis. BMC Genomics 2022; 23:533. [PMID: 35869426 PMCID: PMC9308344 DOI: 10.1186/s12864-022-08775-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2022] [Accepted: 07/18/2022] [Indexed: 11/23/2022] Open
Abstract
Background GenoLab M is a recently developed next-generation sequencing (NGS) platform from GeneMind Biosciences. To establish the performance of GenoLab M, we present the first report to benchmark and compare the WGS and WES sequencing data of the GenoLab M sequencer to NovaSeq 6000 and NextSeq 550 platform in various types of analysis. For WGS, thirty-fold sequencing from Illumina NovaSeq platform and processed by GATK pipeline is currently considered as the golden standard. Thus this dataset is generated as a benchmark reference in this study. Results GenoLab M showed an average of 94.62% of Q20 percentage for base quality, while the NovaSeq was slightly higher at 96.97%. However, GenoLab M outperformed NovaSeq or NextSeq at a duplication rate, suggesting more usable data after deduplication. For WGS short variant calling, GenoLab M showed significant accuracy improvement over the same depth dataset from NovaSeq, and reached similar accuracy to NovaSeq 33X dataset with 22x depth. For 100X WES, the F-score and Precision in GenoLab M were higher than NovaSeq or NextSeq, especially for InDel calling. Conclusions GenoLab M is a promising NGS platform for high-performance WGS and WES applications. For WGS, 22X depth in the GenoLab M sequencing platform offers a cost-effective alternative to the current mainstream 33X depth on Illumina.
Collapse
|
30
|
Deep learning and multi-omics approach to predict drug responses in cancer. BMC Bioinformatics 2022; 22:632. [PMID: 36443676 PMCID: PMC9703655 DOI: 10.1186/s12859-022-04964-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2022] [Accepted: 09/25/2022] [Indexed: 11/29/2022] Open
Abstract
BACKGROUND Cancers are genetically heterogeneous, so anticancer drugs show varying degrees of effectiveness on patients due to their differing genetic profiles. Knowing patient's responses to numerous cancer drugs are needed for personalized treatment for cancer. By using molecular profiles of cancer cell lines available from Cancer Cell Line Encyclopedia (CCLE) and anticancer drug responses available in the Genomics of Drug Sensitivity in Cancer (GDSC), we will build computational models to predict anticancer drug responses from molecular features. RESULTS We propose a novel deep neural network model that integrates multi-omics data available as gene expressions, copy number variations, gene mutations, reverse phase protein array expressions, and metabolomics expressions, in order to predict cellular responses to known anti-cancer drugs. We employ a novel graph embedding layer that incorporates interactome data as prior information for prediction. Moreover, we propose a novel attention layer that effectively combines different omics features, taking their interactions into account. The network outperformed feedforward neural networks and reported 0.90 for [Formula: see text] values for prediction of drug responses from cancer cell lines data available in CCLE and GDSC. CONCLUSION The outstanding results of our experiments demonstrate that the proposed method is capable of capturing the interactions of genes and proteins, and integrating multi-omics features effectively. Furthermore, both the results of ablation studies and the investigations of the attention layer imply that gene mutation has a greater influence on the prediction of drug responses than other omics data types. Therefore, we conclude that our approach can not only predict the anti-cancer drug response precisely but also provides insights into reaction mechanisms of cancer cell lines and drugs as well.
Collapse
|
31
|
DLBLS_SS: protein secondary structure prediction using deep learning and broad learning system. RSC Adv 2022; 12:33479-33487. [PMID: 36505696 PMCID: PMC9682407 DOI: 10.1039/d2ra06433b] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2022] [Accepted: 11/16/2022] [Indexed: 11/24/2022] Open
Abstract
Protein secondary structure prediction (PSSP) is not only beneficial to the study of protein structure and function but also to the development of drugs. As a challenging task in computational biology, experimental methods for PSSP are time-consuming and expensive. In this paper, we propose a novel PSSP model DLBLS_SS based on deep learning and broad learning system (BLS) to predict 3-state and 8-state secondary structure. We first use a bidirectional long short-term memory (BLSTM) network to extract global features in residue sequences. Then, our proposed SEBTCN based on temporal convolutional networks (TCN) and channel attention can capture bidirectional key long-range dependencies in sequences. We also use BLS to rapidly optimize fused features while further capturing local interactions between residues. We conduct extensive experiments on public test sets including CASP10, CASP11, CASP12, CASP13, CASP14 and CB513 to evaluate the performance of the model. Experimental results show that our model exhibits better 3-state and 8-state PSSP performance compared to five state-of-the-art models.
Collapse
|
32
|
Identification of potential microRNA diagnostic panels and uncovering regulatory mechanisms in breast cancer pathogenesis. Sci Rep 2022; 12:20135. [PMID: 36418345 PMCID: PMC9684445 DOI: 10.1038/s41598-022-24347-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2022] [Accepted: 11/14/2022] [Indexed: 11/24/2022] Open
Abstract
Early diagnosis of breast cancer (BC), as the most common cancer among women, increases the survival rate and effectiveness of treatment. MicroRNAs (miRNAs) control various cell behaviors, and their dysregulation is widely involved in pathophysiological processes such as BC development and progress. In this study, we aimed to identify potential miRNA biomarkers for early diagnosis of BC. We also proposed a consensus-based strategy to analyze the miRNA expression data to gain a deeper insight into the regulatory roles of miRNAs in BC initiation. Two microarray datasets (GSE106817 and GSE113486) were analyzed to explore the differentially expressed miRNAs (DEMs) in serum of BC patients and healthy controls. Utilizing multiple bioinformatics tools, six serum-based miRNA biomarkers (miR-92a-3p, miR-23b-3p, miR-191-5p, miR-141-3p, miR-590-5p and miR-190a-5p) were identified for BC diagnosis. We applied our consensus and integration approach to construct a comprehensive BC-specific miRNA-TF co-regulatory network. Using different combination of these miRNA biomarkers, two novel diagnostic models, consisting of miR-92a-3p, miR-23b-3p, miR-191-5p (model 1) and miR-92a-3p, miR-23b-3p, miR-141-3p, and miR-590-5p (model 2), were obtained from bioinformatics analysis. Validation analysis was carried out for the considered models on two microarray datasets (GSE73002 and GSE41922). The model based on similar network topology features, comprising miR-92a-3p, miR-23b-3p and miR-191-5p was the most promising model in the diagnosis of BC patients from healthy controls with 0.89 sensitivity, 0.96 specificity and area under the curve (AUC) of 0.98. These findings elucidate the regulatory mechanisms underlying BC and represent novel biomarkers for early BC diagnosis.
Collapse
|
33
|
Sex at the interface: the origin and impact of sex differences in the developing human placenta. Biol Sex Differ 2022; 13:50. [PMID: 36114567 PMCID: PMC9482177 DOI: 10.1186/s13293-022-00459-7] [Citation(s) in RCA: 15] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/20/2022] [Accepted: 09/02/2022] [Indexed: 11/20/2022] Open
Abstract
The fetal placenta is a source of hormones and immune factors that play a vital role in maintaining pregnancy and facilitating fetal growth. Cells in this extraembryonic compartment match the chromosomal sex of the embryo itself. Sex differences have been observed in common gestational pathologies, highlighting the importance of maternal immune tolerance to the fetal compartment. Over the past decade, several studies examining placentas from term pregnancies have revealed widespread sex differences in hormone signaling, immune signaling, and metabolic functions. Given the rapid and dynamic development of the human placenta, sex differences that exist at term (37–42 weeks gestation) are unlikely to align precisely with those present at earlier stages when the fetal–maternal interface is being formed and the foundations of a healthy or diseased pregnancy are established. While fetal sex as a variable is often left unreported in studies performing transcriptomic profiling of the first-trimester human placenta, four recent studies have specifically examined fetal sex in early human placental development. In this review, we discuss the findings from these publications and consider the evidence for the genetic, hormonal, and immune mechanisms that are theorized to account for sex differences in early human placenta. We also highlight the cellular and molecular processes that are most likely to be impacted by fetal sex and the evolutionary pressures that may have given rise to these differences. With growing recognition of the fetal origins of health and disease, it is important to shed light on sex differences in early prenatal development, as these observations may unlock insight into the foundations of sex-biased pathologies that emerge later in life. Placental sex differences exist from early prenatal development, and may help explain sex differences in pregnancy outcomes. Transcriptome profiling of early to mid-gestation placenta reveals that immune signaling is a hub of early prenatal sex differences. Differentially expressed genes between male and female placenta fall into the following functional associations: chromatin modification, transcription, splicing, translation, signal transduction, metabolic regulation, cell death and autophagy regulation, ubiquitination, cell adhesion and cell–cell interaction. Placental sex differences likely reflect the interaction of cell-intrinsic chromosome complement with extrinsic endocrine signals from the fetal compartment that accompany gonadal differentiation. Understanding the mechanisms behind sex differences in placental development and function will provide key insight into molecular targets that can be modulated to improve sex-biased obstetrical complications.
Collapse
|
34
|
MultiHop attention for knowledge diagnosis of mathematics examination. APPL INTELL 2022. [DOI: 10.1007/s10489-022-04033-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
35
|
Anti-lung cancer drug discovery approaches by polysaccharides: an in silico study, quantum calculation and molecular dynamics study. J Biomol Struct Dyn 2022:1-17. [DOI: 10.1080/07391102.2022.2110156] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/15/2022]
|
36
|
Machine learning for the micropeptide encoded by LINC02381 regulates ferroptosis through the glucose transporter SLC2A10 in glioblastoma. BMC Cancer 2022; 22:882. [PMID: 35962317 PMCID: PMC9373536 DOI: 10.1186/s12885-022-09972-9] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2022] [Accepted: 08/03/2022] [Indexed: 11/10/2022] Open
Abstract
Glioblastoma (GBM) is the most common primary intracranial tumor in the central nervous system, and resistance to temozolomide is an important reason for the failure of GBM treatment. We screened out that Solute Carrier Family 2 Member 10 (SLC2A10) is significantly highly expressed in GBM with a poor prognosis, which is also enriched in the NF-E2 p45-related factor 2 (NRF2) signalling pathway. The NRF2 signalling pathway is an important defence mechanism against ferroptosis. SLC2A10 related LINC02381 is highly expressed in GBM, which is localized in the cytoplasm/exosomes, and LINC02381 encoded micropeptides are localized in the exosomes. The micropeptide encoded by LINC02381 may be a potential treatment strategy for GBM, but the underlying mechanism of its function is not precise yet. We put forward the hypothesis: “The micropeptide encoded by LINC02381 regulates ferroptosis through the glucose transporter SLC2A10 in GBM.” This study innovatively used machine learning for micropeptide to provide personalized diagnosis and treatment plans for precise treatment of GBM, thereby promoting the development of translational medicine. The study aimed to help find new disease diagnoses and prognostic biomarkers and provide a new strategy for experimental scientists to design the downstream validation experiments.
Collapse
|
37
|
Targeting SARS-CoV-2 endoribonuclease: a structure-based virtual screening supported by in vitro analysis. Sci Rep 2022; 12:13337. [PMID: 35922447 PMCID: PMC9349323 DOI: 10.1038/s41598-022-17573-6] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2022] [Accepted: 07/27/2022] [Indexed: 11/13/2022] Open
Abstract
Researchers are focused on discovering compounds that can interfere with the COVID-19 life cycle. One of the important non-structural proteins is endoribonuclease since it is responsible for processing viral RNA to evade detection of the host defense system. This work investigates a hierarchical structure-based virtual screening approach targeting NSP15. Different filtering approaches to predict the interactions of the compounds have been included in this study. Using a deep learning technique, we screened 823,821 compounds from five different databases (ZINC15, NCI, Drug Bank, Maybridge, and NCI Diversity set III). Subsequently, two docking protocols (extra precision and induced fit) were used to assess the binding affinity of the compounds, followed by molecular dynamic simulation supported by the MM-GBSA free binding energy. Interestingly, one compound (ZINC000104379474) from the ZINC15 database has been found to have a good binding affinity of − 7.68 kcal/Mol. The VERO-E6 cell line was used to investigate its therapeutic effect in vitro. Half-maximal cytotoxic concentration and Inhibitory concentration 50 were determined to be 0.9 mg/ml and 0.01 mg/ml, respectively; therefore, the selectivity index is 90. In conclusion, ZINC000104379474 was shown to be a good hit for targeting the virus that needs further investigations in vivo to be a drug candidate.
Collapse
|
38
|
Learning size-adaptive molecular substructures for explainable drug-drug interaction prediction by substructure-aware graph neural network. Chem Sci 2022; 13:8693-8703. [PMID: 35974769 PMCID: PMC9337739 DOI: 10.1039/d2sc02023h] [Citation(s) in RCA: 19] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2022] [Accepted: 07/06/2022] [Indexed: 01/03/2023] Open
Abstract
Drug-drug interactions (DDIs) can trigger unexpected pharmacological effects on the body, and the causal mechanisms are often unknown. Graph neural networks (GNNs) have been developed to better understand DDIs. However, identifying key substructures that contribute most to the DDI prediction is a challenge for GNNs. In this study, we presented a substructure-aware graph neural network, a message passing neural network equipped with a novel substructure attention mechanism and a substructure-substructure interaction module (SSIM) for DDI prediction (SA-DDI). Specifically, the substructure attention was designed to capture size- and shape-adaptive substructures based on the chemical intuition that the sizes and shapes are often irregular for functional groups in molecules. DDIs are fundamentally caused by chemical substructure interactions. Thus, the SSIM was used to model the substructure-substructure interactions by highlighting important substructures while de-emphasizing the minor ones for DDI prediction. We evaluated our approach in two real-world datasets and compared the proposed method with the state-of-the-art DDI prediction models. The SA-DDI surpassed other approaches on the two datasets. Moreover, the visual interpretation results showed that the SA-DDI was sensitive to the structure information of drugs and was able to detect the key substructures for DDIs. These advantages demonstrated that the proposed method improved the generalization and interpretation capability of DDI prediction modeling.
Collapse
|
39
|
PHARP: a pig haplotype reference panel for genotype imputation. Sci Rep 2022; 12:12645. [PMID: 35879321 PMCID: PMC9314402 DOI: 10.1038/s41598-022-15851-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2021] [Accepted: 06/30/2022] [Indexed: 11/18/2022] Open
Abstract
Pigs not only function as a major meat source worldwide but also are commonly used as an animal model for studying human complex traits. A large haplotype reference panel has been used to facilitate efficient phasing and imputation of relatively sparse genome-wide microarray chips and low-coverage sequencing data. Using the imputed genotypes in the downstream analysis, such as GWASs, TWASs, eQTL mapping and genomic prediction (GS), is beneficial for obtaining novel findings. However, currently, there is still a lack of publicly available and high-quality pig reference panels with large sample sizes and high diversity, which greatly limits the application of genotype imputation in pigs. In response, we built the pig Haplotype Reference Panel (PHARP) database. PHARP provides a reference panel of 2012 pig haplotypes at 34 million SNPs constructed using whole-genome sequence data from more than 49 studies of 71 pig breeds. It also provides Web-based analytical tools that allow researchers to carry out phasing and imputation consistently and efficiently. PHARP is freely accessible at http://alphaindex.zju.edu.cn/PHARP/index.php. We demonstrate its applicability for pig commercial 50 K SNP arrays, by accurately imputing 2.6 billion genotypes at a concordance rate value of 0.971 in 81 Large White pigs (~ 17 × sequencing coverage). We also applied our reference panel to impute the low-density SNP chip into the high-density data for three GWASs and found novel significantly associated SNPs that might be casual variants.
Collapse
|
40
|
Repurposing clinically approved drugs as Wee1 checkpoint kinase inhibitors: an in silico investigation integrating molecular docking, ensemble QSAR modelling and molecular dynamics simulation. MOLECULAR SIMULATION 2022. [DOI: 10.1080/08927022.2022.2101673] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
|
41
|
Automatic question answering for multiple stakeholders, the epidemic question answering dataset. Sci Data 2022; 9:432. [PMID: 35864125 PMCID: PMC9302224 DOI: 10.1038/s41597-022-01533-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2022] [Accepted: 07/05/2022] [Indexed: 11/09/2022] Open
Abstract
One of the effects of COVID-19 pandemic is a rapidly growing and changing stream of publications to inform clinicians, researchers, policy makers, and patients about the health, socio-economic, and cultural consequences of the pandemic. Managing this information stream manually is not feasible. Automatic Question Answering can quickly bring the most salient points to the user's attention. Leveraging a collection of scientific articles, government websites, relevant news articles, curated social media posts, and questions asked by researchers, clinicians, and the general public, we developed a dataset to explore automatic Question Answering for multiple stakeholders. Analysis of questions asked by various stakeholders shows that while information needs of experts and the public may overlap, satisfactory answers to these questions often originate from different information sources or benefit from different approaches to answer generation. We believe that this dataset has the potential to support the development of question answering systems not only for epidemic questions, but for other domains with varying expertise such as legal or finance.
Collapse
|
42
|
DriverRWH: discovering cancer driver genes by random walk on a gene mutation hypergraph. BMC Bioinformatics 2022; 23:277. [PMID: 35831792 PMCID: PMC9281118 DOI: 10.1186/s12859-022-04788-7] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2021] [Accepted: 06/08/2022] [Indexed: 12/24/2022] Open
Abstract
Background Recent advances in next-generation sequencing technologies have helped investigators generate massive amounts of cancer genomic data. A critical challenge in cancer genomics is identification of a few cancer driver genes whose mutations cause tumor growth. However, the majority of existing computational approaches underuse the co-occurrence mutation information of the individuals, which are deemed to be important in tumorigenesis and tumor progression, resulting in high rate of false positive. Results To make full use of co-mutation information, we present a random walk algorithm referred to as DriverRWH on a weighted gene mutation hypergraph model, using somatic mutation data and molecular interaction network data to prioritize candidate driver genes. Applied to tumor samples of different cancer types from The Cancer Genome Atlas, DriverRWH shows significantly better performance than state-of-art prioritization methods in terms of the area under the curve scores and the cumulative number of known driver genes recovered in top-ranked candidate genes. Besides, DriverRWH discovers several potential drivers, which are enriched in cancer-related pathways. DriverRWH recovers approximately 50% known driver genes in the top 30 ranked candidate genes for more than half of the cancer types. In addition, DriverRWH is also highly robust to perturbations in the mutation data and gene functional network data. Conclusion DriverRWH is effective among various cancer types in prioritizes cancer driver genes and provides considerable improvement over other tools with a better balance of precision and sensitivity. It can be a useful tool for detecting potential driver genes and facilitate targeted cancer therapies. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04788-7.
Collapse
|
43
|
The clove (Syzygium aromaticum) genome provides insights into the eugenol biosynthesis pathway. Commun Biol 2022; 5:684. [PMID: 35810198 PMCID: PMC9271057 DOI: 10.1038/s42003-022-03618-z] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2020] [Accepted: 06/22/2022] [Indexed: 11/09/2022] Open
Abstract
The clove (Syzygium aromaticum) is an important tropical spice crop in global trade. Evolving environmental pressures necessitate modern characterization and selection techniques that are currently inaccessible to clove growers owing to the scarcity of genomic and genetic information. Here, we present a 370-Mb high-quality chromosome-scale genome assembly for clove. Comparative genomic analysis between S. aromaticum and Eucalyptus grandis—both species of the Myrtaceae family—reveals good genome structure conservation and intrachromosomal rearrangements on seven of the eleven chromosomes. We report genes that belong to families involved in the biosynthesis of eugenol, the major bioactive component of clove products. On the basis of our transcriptomic and metabolomic findings, we propose a hypothetical scenario in which eugenol acetate plays a key role in high eugenol accumulation in clove leaves and buds. The clove genome is a new contribution to omics resources for the Myrtaceae family and an important tool for clove research. A newly assembled clove genome is compared with E. grandis to investigate genome evolution between the two genera of the Myrtaceae family, and putative genes involved in the biosynthesis of eugenol are identified through transcriptomics and metabolomics.
Collapse
|
44
|
Exploring deep learning methods for recognizing rare diseases and their clinical manifestations from texts. BMC Bioinformatics 2022; 23:263. [PMID: 35794528 PMCID: PMC9258216 DOI: 10.1186/s12859-022-04810-y] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2022] [Accepted: 06/21/2022] [Indexed: 11/10/2022] Open
Abstract
Abstract
Background and objective
Although rare diseases are characterized by low prevalence, approximately 400 million people are affected by a rare disease. The early and accurate diagnosis of these conditions is a major challenge for general practitioners, who do not have enough knowledge to identify them. In addition to this, rare diseases usually show a wide variety of manifestations, which might make the diagnosis even more difficult. A delayed diagnosis can negatively affect the patient’s life. Therefore, there is an urgent need to increase the scientific and medical knowledge about rare diseases. Natural Language Processing (NLP) and Deep Learning can help to extract relevant information about rare diseases to facilitate their diagnosis and treatments.
Methods
The paper explores several deep learning techniques such as Bidirectional Long Short Term Memory (BiLSTM) networks or deep contextualized word representations based on Bidirectional Encoder Representations from Transformers (BERT) to recognize rare diseases and their clinical manifestations (signs and symptoms).
Results
BioBERT, a domain-specific language representation based on BERT and trained on biomedical corpora, obtains the best results with an F1 of 85.2% for rare diseases. Since many signs are usually described by complex noun phrases that involve the use of use of overlapped, nested and discontinuous entities, the model provides lower results with an F1 of 57.2%.
Conclusions
While our results are promising, there is still much room for improvement, especially with respect to the identification of clinical manifestations (signs and symptoms).
Collapse
|
45
|
Knowledge Graph and Deep Learning-based Text-to-GQL Model for Intelligent Medical Consultation Chatbot. INFORMATION SYSTEMS FRONTIERS : A JOURNAL OF RESEARCH AND INNOVATION 2022; 26:1-19. [PMID: 35815295 PMCID: PMC9257573 DOI: 10.1007/s10796-022-10295-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Accepted: 05/10/2022] [Indexed: 05/05/2023]
Abstract
Text-to-GQL (Text2GQL) is a task that converts the user's questions into GQL (Graph Query Language) when a graph database is given. That is a task of semantic parsing that transforms natural language problems into logical expressions, which will bring more efficient direct communication between humans and machines. The existing related work mainly focuses on Text-to-SQL tasks, and there is no available semantic parsing method and data set for the graph database. In order to fill the gaps in this field to serve the medical Human-Robot Interactions (HRI) better, we propose this task and a pipeline solution for the Text2GQL task. This solution uses the Adapter pre-trained by "the linking of GQL schemas and the corresponding utterances" as an external knowledge introduction plug-in. By inserting the Adapter into the language model, the mapping between logical language and natural language can be introduced faster and more directly to better realize the end-to-end human-machine language translation task. In the study, the proposed Text2GQL task model is mainly constructed based on an improved pipeline composed of a Language Model, Pre-trained Adapter plug-in, and Pointer Network. This enables the model to copy objects' tokens from utterances, generate corresponding GQL statements for graph database retrieval, and builds an adjustment mechanism to improve the final output. And the experiments have proved that our proposed method has certain competitiveness on the counterpart datasets (Spider, ATIS, GeoQuery, and 39.net) converted from the Text2SQL task, and the proposed method is also practical in medical scenarios.
Collapse
|
46
|
CancerNet: a unified deep learning network for pan-cancer diagnostics. BMC Bioinformatics 2022; 23:229. [PMID: 35698059 PMCID: PMC9195411 DOI: 10.1186/s12859-022-04783-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2022] [Accepted: 06/06/2022] [Indexed: 11/10/2022] Open
Abstract
Background Despite remarkable advances in cancer research, cancer remains one of the leading causes of death worldwide. Early detection of cancer and localization of the tissue of its origin are key to effective treatment. Here, we leverage technological advances in machine learning or artificial intelligence to design a novel framework for cancer diagnostics. Our proposed framework detects cancers and their tissues of origin using a unified model of cancers encompassing 33 cancers represented in The Cancer Genome Atlas (TCGA). Our model exploits the learned features of different cancers reflected in the respective dysregulated epigenomes, which arise early in carcinogenesis and differ remarkably between different cancer types or subtypes, thus holding a great promise in early cancer detection. Results Our comprehensive assessment of the proposed model on the 33 different tissues of origin demonstrates its ability to detect and classify cancers to a high accuracy (> 99% overall F-measure). Furthermore, our model distinguishes cancers from pre-cancerous lesions to metastatic tumors and discriminates between hypomethylation changes due to age related epigenetic drift and true cancer. Conclusions Beyond detection of primary cancers, our proposed computational model also robustly detects tissues of origin of secondary cancers, including metastatic cancers, second primary cancers, and cancers of unknown primaries. Our assessment revealed the ability of this model to characterize pre-cancer samples, a significant step forward in early cancer detection. Deployed broadly this model can deliver accurate diagnosis for a greatly expanded target patient population. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04783-y.
Collapse
|
47
|
Salinity tolerance mechanisms of an Arctic Pelagophyte using comparative transcriptomic and gene expression analysis. Commun Biol 2022; 5:500. [PMID: 35614207 PMCID: PMC9133084 DOI: 10.1038/s42003-022-03461-2] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2021] [Accepted: 05/09/2022] [Indexed: 11/09/2022] Open
Abstract
Little is known at the transcriptional level about microbial eukaryotic adaptations to short-term salinity change. Arctic microalgae are exposed to low salinity due to sea-ice melt and higher salinity with brine channel formation during freeze-up. Here, we investigate the transcriptional response of an ice-associated microalgae over salinities from 45 to 8. Our results show a bracketed response of differential gene expression when the cultures were exposed to progressively decreasing salinity. Key genes associated with salinity changes were involved in specific metabolic pathways, transcription factors and regulators, protein kinases, carbohydrate active enzymes, and inorganic ion transporters. The pelagophyte seemed to use a strategy involving overexpression of Na+-H+ antiporters and Na+ -Pi symporters as salinity decreases, but the K+ channel complex at higher salinities. Specific adaptation to cold saline arctic conditions was seen with differential expression of several antifreeze proteins, an ice-binding protein and an acyl-esterase involved in cold adaptation.
Collapse
|
48
|
Microcalcification Discrimination in Mammography Using Deep Convolutional Neural Network: Towards Rapid and Early Breast Cancer Diagnosis. Front Public Health 2022; 10:875305. [PMID: 35570962 PMCID: PMC9096221 DOI: 10.3389/fpubh.2022.875305] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2022] [Accepted: 04/04/2022] [Indexed: 11/30/2022] Open
Abstract
Breast cancer is among the most common types of cancer in women and under the cases of misdiagnosed, or delayed in treatment, the mortality risk is high. The existence of breast microcalcifications is common in breast cancer patients and they are an effective indicator for early sign of breast cancer. However, microcalcifications are often missed and wrongly classified during screening due to their small sizes and indirect scattering in mammogram images. Motivated by this issue, this project proposes an adaptive transfer learning deep convolutional neural network in segmenting breast mammogram images with calcifications cases for early breast cancer diagnosis and intervention. Mammogram images of breast microcalcifications are utilized to train several deep neural network models and their performance is compared. Image filtering of the region of interest images was conducted to remove possible artifacts and noises to enhance the quality of the images before the training. Different hyperparameters such as epoch, batch size, etc were tuned to obtain the best possible result. In addition, the performance of the proposed fine-tuned hyperparameter of ResNet50 is compared with another state-of-the-art machine learning network such as ResNet34, VGG16, and AlexNet. Confusion matrices were utilized for comparison. The result from this study shows that the proposed ResNet50 achieves the highest accuracy with a value of 97.58%, followed by ResNet34 of 97.35%, VGG16 96.97%, and finally AlexNet of 83.06%.
Collapse
|
49
|
MAGCNSE: predicting lncRNA-disease associations using multi-view attention graph convolutional network and stacking ensemble model. BMC Bioinformatics 2022; 23:189. [PMID: 35590258 PMCID: PMC9118755 DOI: 10.1186/s12859-022-04715-w] [Citation(s) in RCA: 17] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2022] [Accepted: 05/05/2022] [Indexed: 01/02/2023] Open
Abstract
Background Many long non-coding RNAs (lncRNAs) have key roles in different human biologic processes and are closely linked to numerous human diseases, according to cumulative evidence. Predicting potential lncRNA-disease associations can help to detect disease biomarkers and perform disease analysis and prevention. Establishing effective computational methods for lncRNA-disease association prediction is critical.
Results In this paper, we propose a novel model named MAGCNSE to predict underlying lncRNA-disease associations. We first obtain multiple feature matrices from the multi-view similarity graphs of lncRNAs and diseases utilizing graph convolutional network. Then, the weights are adaptively assigned to different feature matrices of lncRNAs and diseases using the attention mechanism. Next, the final representations of lncRNAs and diseases is acquired by further extracting features from the multi-channel feature matrices of lncRNAs and diseases using convolutional neural network. Finally, we employ a stacking ensemble classifier, consisting of multiple traditional machine learning classifiers, to make the final prediction. The results of ablation studies in both representation learning methods and classification methods demonstrate the validity of each module. Furthermore, we compare the overall performance of MAGCNSE with that of six other state-of-the-art models, the results show that it outperforms the other methods. Moreover, we verify the effectiveness of using multi-view data of lncRNAs and diseases. Case studies further reveal the outstanding ability of MAGCNSE in the identification of potential lncRNA-disease associations.
Conclusions The experimental results indicate that MAGCNSE is a useful approach for predicting potential lncRNA-disease associations. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04715-w.
Collapse
|
50
|
A global [Formula: see text] gene co-expression network constructed from hundreds of experimental conditions with missing values. BMC Bioinformatics 2022; 23:170. [PMID: 35534830 PMCID: PMC9082846 DOI: 10.1186/s12859-022-04697-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2022] [Accepted: 04/25/2022] [Indexed: 11/21/2022] Open
Abstract
BACKGROUND Gene co-expression networks (GCNs) can be used to determine gene regulation and attribute gene function to biological processes. Different high throughput technologies, including one and two-channel microarrays and RNA-sequencing, allow evaluating thousands of gene expression data simultaneously, but these methodologies provide results that cannot be directly compared. Thus, it is complex to analyze co-expression relations between genes, especially when there are missing values arising for experimental reasons. Networks are a helpful tool for studying gene co-expression, where nodes represent genes and edges represent co-expression of pairs of genes. RESULTS In this paper, we establish a method for constructing a gene co-expression network for the Anopheles gambiae transcriptome from 257 unique studies obtained with different methodologies and experimental designs. We introduce the sliding threshold approach to select node pairs with high Pearson correlation coefficients. The resulting network, which we name AgGCN1.0, is robust to random removal of conditions and has similar characteristics to small-world and scale-free networks. Analysis of network sub-graphs revealed that the core is largely comprised of genes that encode components of the mitochondrial respiratory chain and the ribosome, while different communities are enriched for genes involved in distinct biological processes. CONCLUSION Analysis of the network reveals that both the architecture of the core sub-network and the network communities are based on gene function, supporting the power of the proposed method for GCN construction. Application of network science methodology reveals that the overall network structure is driven to maximize the integration of essential cellular functions, possibly allowing the flexibility to add novel functions.
Collapse
|