1
|
Scoring alignments by embedding vector similarity. Brief Bioinform 2024; 25:bbae178. [PMID: 38695119 PMCID: PMC11063651 DOI: 10.1093/bib/bbae178] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2023] [Revised: 03/20/2024] [Accepted: 03/31/2024] [Indexed: 05/05/2024] Open
Abstract
Sequence similarity is of paramount importance in biology, as similar sequences tend to have similar function and share common ancestry. Scoring matrices, such as PAM or BLOSUM, play a crucial role in all bioinformatics algorithms for identifying similarities, but have the drawback that they are fixed, independent of context. We propose a new scoring method for amino acid similarity that remedies this weakness, being contextually dependent. It relies on recent advances in deep learning architectures that employ self-supervised learning in order to leverage the power of enormous amounts of unlabelled data to generate contextual embeddings, which are vector representations for words. These ideas have been applied to protein sequences, producing embedding vectors for protein residues. We propose the E-score between two residues as the cosine similarity between their embedding vector representations. Thorough testing on a wide variety of reference multiple sequence alignments indicate that the alignments produced using the new $E$-score method, especially ProtT5-score, are significantly better than those obtained using BLOSUM matrices. The new method proposes to change the way alignments are computed, with far-reaching implications in all areas of textual data that use sequence similarity. The program to compute alignments based on various $E$-scores is available as a web server at e-score.csd.uwo.ca. The source code is freely available for download from github.com/lucian-ilie/E-score.
Collapse
|
2
|
A Chromosome-Level Genome Assembly of the Non-Hematophagous Leech Whitmania pigra (Whitman 1884): Identification and Expression Analysis of Antithrombotic Genes. Genes (Basel) 2024; 15:164. [PMID: 38397154 PMCID: PMC10887747 DOI: 10.3390/genes15020164] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2023] [Revised: 01/19/2024] [Accepted: 01/24/2024] [Indexed: 02/25/2024] Open
Abstract
Despite being a non-hematophagous leech, Whitmania pigra is widely used in traditional Chinese medicine for the treatment of antithrombotic diseases. In this study, we provide a high quality genome of W. pigra and based on which, we performed a systematic identification of the potential antithrombotic genes and their corresponding proteins. We identified twenty antithrombotic gene families including thirteen coagulation inhibitors, three platelet aggregation inhibitors, three fibrinolysis enhancers, and one tissue penetration enhancer. Unexpectedly, a total of 79 antithrombotic genes were identified, more than a typical blood-feeding Hirudinaria manillensis, which had only 72 antithrombotic genes. In addition, combining with the RNA-seq data of W. pigra and H. manillensis, we calculated the expression levels of antithrombotic genes of the two species. Five and four gene families had significantly higher and lower expression levels in W. pigra than in H. manillensis, respectively. These results showed that the number and expression level of antithrombotic genes of a non-hematophagous leech are not always less than those of a hematophagous leech. Our study provides the most comprehensive collection of antithrombotic biomacromolecules from a non-hematophagous leech to date and will significantly enhance the investigation and utilization of leech derivatives in thrombosis therapy research and pharmaceutical applications.
Collapse
|
3
|
GSLAlign: community detection and local PPI network alignment. J Biomol Struct Dyn 2024:1-9. [PMID: 38214492 DOI: 10.1080/07391102.2024.2301757] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2023] [Accepted: 12/29/2023] [Indexed: 01/13/2024]
Abstract
High throughput protein-protein interaction (PPI) profiling and computational techniques have resulted in generating a large amount of PPI network data. The study of PPI networks helps in understanding the biological processes of the proteins. The comparative study of the PPI networks helps in identifying the conserved interactions across the species. This article presents a novel local PPI network aligner 'GSLAlign' that consists of two stages. It first detects the communities from the PPI networks by applying the GraphSAGE algorithm using gene expression data. In the second stage, the detected communities are aligned using a community aligner that is based on protein sequence similarity. The community detection algorithm produces more separable and biologically accurate communities as compared to previous community detection algorithms. Moreover, the proposed community alignment algorithm achieves 3-8% better results in terms of semantic similarity as compared to previous local aligners. The average connectivity and coverage of the proposed algorithm are also better than the existing aligners.Communicated by Ramaswamy H. Sarma.
Collapse
|
4
|
Proteinortho6: pseudo-reciprocal best alignment heuristic for graph-based detection of (co-)orthologs. FRONTIERS IN BIOINFORMATICS 2023; 3:1322477. [PMID: 38152702 PMCID: PMC10751348 DOI: 10.3389/fbinf.2023.1322477] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2023] [Accepted: 11/06/2023] [Indexed: 12/29/2023] Open
Abstract
Proteinortho is a widely used tool to predict (co)-orthologous groups of genes for any set of species. It finds application in comparative and functional genomics, phylogenomics, and evolutionary reconstructions. With a rapidly increasing number of available genomes, the demand for large-scale predictions is also growing. In this contribution, we evaluate and implement major algorithmic improvements that significantly enhance the speed of the analysis without reducing precision. Graph-based detection of (co-)orthologs is typically based on a reciprocal best alignment heuristic that requires an all vs. all comparison of proteins from all species under study. The initial identification of similar proteins is accelerated by introducing an alternative search tool along with a revised search strategy-the pseudo-reciprocal best alignment heuristic-that reduces the number of required sequence comparisons by one-half. The clustering algorithm was reworked to efficiently decompose very large clusters and accelerate processing. Proteinortho6 reduces the overall processing time by an order of magnitude compared to its predecessor while maintaining its small memory footprint and good predictive quality.
Collapse
|
5
|
SAMNA: accurate alignment of multiple biological networks based on simulated annealing. J Integr Bioinform 2023; 20:jib-2023-0006. [PMID: 38097366 PMCID: PMC10777366 DOI: 10.1515/jib-2023-0006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2023] [Accepted: 08/27/2023] [Indexed: 01/11/2024] Open
Abstract
Proteins are important parts of the biological structures and encode a lot of biological information. Protein-protein interaction network alignment is a model for analyzing proteins that helps discover conserved functions between organisms and predict unknown functions. In particular, multi-network alignment aims at finding the mapping relationship among multiple network nodes, so as to transfer the knowledge across species. However, with the increasing complexity of PPI networks, how to perform network alignment more accurately and efficiently is a new challenge. This paper proposes a new global network alignment algorithm called Simulated Annealing Multiple Network Alignment (SAMNA), using both network topology and sequence homology information. To generate the alignment, SAMNA first generates cross-network candidate clusters by a clustering algorithm on a k-partite similarity graph constructed with sequence similarity information, and then selects candidate cluster nodes as alignment results and optimizes them using an improved simulated annealing algorithm. Finally, the SAMNA algorithm was experimented on synthetic and real-world network datasets, and the results showed that SAMNA outperformed the state-of-the-art algorithm in biological performance.
Collapse
|
6
|
Identification of Two Flip-Over Genes in Grass Family as Potential Signature of C4 Photosynthesis Evolution. Int J Mol Sci 2023; 24:14165. [PMID: 37762466 PMCID: PMC10531853 DOI: 10.3390/ijms241814165] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2023] [Revised: 09/05/2023] [Accepted: 09/13/2023] [Indexed: 09/29/2023] Open
Abstract
In flowering plants, C4 photosynthesis is superior to C3 type in carbon fixation efficiency and adaptation to extreme environmental conditions, but the mechanisms behind the assembly of C4 machinery remain elusive. This study attempts to dissect the evolutionary divergence from C3 to C4 photosynthesis in five photosynthetic model plants from the grass family, using a combined comparative transcriptomics and deep learning technology. By examining and comparing gene expression levels in bundle sheath and mesophyll cells of five model plants, we identified 16 differentially expressed signature genes showing cell-specific expression patterns in C3 and C4 plants. Among them, two showed distinctively opposite cell-specific expression patterns in C3 vs. C4 plants (named as FOGs). The in silico physicochemical analysis of the two FOGs illustrated that C3 homologous proteins of LHCA6 had low and stable pI values of ~6, while the pI values of LHCA6 homologs increased drastically in C4 plants Setaria viridis (7), Zea mays (8), and Sorghum bicolor (over 9), suggesting this protein may have different functions in C3 and C4 plants. Interestingly, based on pairwise protein sequence/structure similarities between each homologous FOG protein, one FOG PGRL1A showed local inconsistency between sequence similarity and structure similarity. To find more examples of the evolutionary characteristics of FOG proteins, we investigated the protein sequence/structure similarities of other FOGs (transcription factors) and found that FOG proteins have diversified incompatibility between sequence and structure similarities during grass family evolution. This raised an interesting question as to whether the sequence similarity is related to structure similarity during C4 photosynthesis evolution.
Collapse
|
7
|
Evaluation of Five Mammalian Models for Human Disease Research Using Genomic and Bioinformatic Approaches. Biomedicines 2023; 11:2197. [PMID: 37626695 PMCID: PMC10452283 DOI: 10.3390/biomedicines11082197] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2023] [Revised: 07/27/2023] [Accepted: 07/28/2023] [Indexed: 08/27/2023] Open
Abstract
The suitability of an animal model for use in studying human diseases relies heavily on the similarities between the two species at the genetic, epigenetic, and metabolic levels. However, there is a lack of consistent data from different animal models at each level to evaluate this suitability. With the availability of genome sequences for many mammalian species, it is now possible to compare animal models based on genomic similarities. Herein, we compare the coding sequences (CDSs) of five mammalian models, including rhesus macaque, marmoset, pig, mouse, and rat models, with human coding sequences. We identified 10,316 conserved CDSs across the five organisms and the human genome based on sequence similarity. Mapping the human-disease-associated single-nucleotide polymorphisms (SNPs) from these conserved CDSs in each species has identified species-specific associations with various human diseases. While associations with a disease such as colon cancer were prevalent in multiple model species, the rhesus macaque showed the most model-specific human disease associations. Based on the percentage of disease-associated SNP-containing genes, marmoset models are well suited to study many human ailments, including behavioral and cardiovascular diseases. This study demonstrates a genomic similarity evaluation of five animal models against human CDSs that could help investigators select a suitable animal model for studying their target disease.
Collapse
|
8
|
SARS-CoV-2 Gut-Targeted Epitopes: Sequence Similarity and Cross-Reactivity Join Together for Molecular Mimicry. Biomedicines 2023; 11:1937. [PMID: 37509576 PMCID: PMC10376948 DOI: 10.3390/biomedicines11071937] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2023] [Revised: 07/02/2023] [Accepted: 07/06/2023] [Indexed: 07/30/2023] Open
Abstract
The gastrointestinal tract can be heavily infected by SARS-CoV-2. Being an auto-immunogenic virus, SARS-CoV-2 represents an environmental factor that might play a role in gut-associated autoimmune diseases. However, molecular mimicry between the virus and the intestinal epitopes is under-investigated. The present study aims to elucidate sequence similarity between viral antigens and human enteric sequences, based on known cross-reactivity. SARS-CoV-2 epitopes that cross-react with human gut antigens were explored, and sequence alignment was performed against self-antigens implicated in enteric autoimmune conditions. Experimental SARS-CoV-2 epitopes were aggregated from the Immune Epitope Database (IEDB), while enteric antigens were obtained from the UniProt Knowledgebase. A Pairwise Local Alignment tool, EMBOSS Matcher, was employed for the similarity search. Sequence similarity and targeted cross-reactivity were depicted between 10 pairs of immunoreactive epitopes. Similar pairs were found in four viral proteins and seven enteric antigens related to ulcerative colitis, primary biliary cholangitis, celiac disease, and autoimmune hepatitis. Antibodies made against the viral proteins that were cross-reactive with human gut antigens are involved in several essential cellular functions. The relationship and contribution of those intestinal cross-reactive epitopes to SARS-CoV-2 or its potential contribution to gut auto-immuno-genesis are discussed.
Collapse
|
9
|
Molecular Structure and Variation Characteristics of the Plastomes from Six Malus baccata (L.) Borkh. Individuals and Comparative Genomic Analysis with Other Malus Species. Biomolecules 2023; 13:962. [PMID: 37371542 DOI: 10.3390/biom13060962] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2023] [Revised: 06/01/2023] [Accepted: 06/05/2023] [Indexed: 06/29/2023] Open
Abstract
Malus baccata (L.) Borkh. is an important wild species of Malus. Its rich variation types and population history are not well understood. Chloroplast genome mining plays an active role in germplasm identification and genetic evolution. In this study, by assembly and annotation, six complete cp genome sequences, ranging in size from 160,083 to 160,295 bp, were obtained. The GC content of stable IR regions (42.7%) was significantly higher than that of full length (36.5%) and SC regions (LSC-34.2%, SSC-30.4%). Compared with other Malus species, it was found that there were more sites of polymorphisms and hotspots of variation in LSC and SSC regions, with high variation sites including trnR/UCU-atpA, trnT/UGU-trnL/UAA, ndhF-rpl32 and ccsA-ndhD. The intraspecific and interspecific collinearity was good, and no structural rearrangement was observed. A large number of repeating elements and different boundary expansions may be involved in shaping the cp genome size. Up to 77 or 78 coding genes were annotated in the cp genomes of M. baccata, and high frequency codons such as UUA (Leu), GCU (Ala) and AGA (Arg) were identified by relative synonymous codon usage analysis. Phylogeographic analysis showed that 12 individuals of M. baccata clustered into three different groups with complex structure, whereas variant xiaojinensis (M.H. Cheng & N.G. Jiang) was not closely related to M. baccata evolutionarily. The phylogenetic analysis suggested that two main clades of different M. baccata in the genus Malus were formed and that I and II diverged about 9.7 MYA. In conclusion, through cp genome assembly and comparison, the interspecific relationships and molecular variations of M. baccata were further elucidated, and the results of this study provide valuable information for the phylogenetic evolution and germplasm conservation of M. baccata and Malus.
Collapse
|
10
|
Improved T cell receptor antigen pairing through data-driven filtering of sequencing information from single cells. eLife 2023; 12:e81810. [PMID: 37133356 PMCID: PMC10156162 DOI: 10.7554/elife.81810] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2022] [Accepted: 03/13/2023] [Indexed: 05/04/2023] Open
Abstract
Novel single-cell-based technologies hold the promise of matching T cell receptor (TCR) sequences with their cognate peptide-MHC recognition motif in a high-throughput manner. Parallel capture of TCR transcripts and peptide-MHC is enabled through the use of reagents labeled with DNA barcodes. However, analysis and annotation of such single-cell sequencing (SCseq) data are challenged by dropout, random noise, and other technical artifacts that must be carefully handled in the downstream processing steps. We here propose a rational, data-driven method termed ITRAP (improved T cell Receptor Antigen Paring) to deal with these challenges, filtering away likely artifacts, and enable the generation of large sets of TCR-pMHC sequence data with a high degree of specificity and sensitivity, thus outputting the most likely pMHC target per T cell. We have validated this approach across 10 different virus-specific T cell responses in 16 healthy donors. Across these samples, we have identified up to 1494 high-confident TCR-pMHC pairs derived from 4135 single cells.
Collapse
|
11
|
PepSim: T-cell cross-reactivity prediction via comparison of peptide sequence and peptide-HLA structure. Front Immunol 2023; 14:1108303. [PMID: 37187737 PMCID: PMC10175663 DOI: 10.3389/fimmu.2023.1108303] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2022] [Accepted: 04/12/2023] [Indexed: 05/17/2023] Open
Abstract
Introduction Peptide-HLA class I (pHLA) complexes on the surface of tumor cells can be targeted by cytotoxic T-cells to eliminate tumors, and this is one of the bases for T-cell-based immunotherapies. However, there exist cases where therapeutic T-cells directed towards tumor pHLA complexes may also recognize pHLAs from healthy normal cells. The process where the same T-cell clone recognizes more than one pHLA is referred to as T-cell cross-reactivity and this process is driven mainly by features that make pHLAs similar to each other. T-cell cross-reactivity prediction is critical for designing T-cell-based cancer immunotherapies that are both effective and safe. Methods Here we present PepSim, a novel score to predict T-cell cross-reactivity based on the structural and biochemical similarity of pHLAs. Results and discussion We show our method can accurately separate cross-reactive from non-crossreactive pHLAs in a diverse set of datasets including cancer, viral, and self-peptides. PepSim can be generalized to work on any dataset of class I peptide-HLAs and is freely available as a web server at pepsim.kavrakilab.org.
Collapse
|
12
|
Poincaré maps for visualization of large protein families. Brief Bioinform 2023; 24:7083418. [PMID: 36946414 DOI: 10.1093/bib/bbad103] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2022] [Revised: 02/09/2023] [Accepted: 02/28/2023] [Indexed: 03/23/2023] Open
Abstract
In the era of constantly increasing amounts of the available protein data, a relevant and interpretable visualization becomes crucial, especially for tasks requiring human expertise. Poincaré disk projection has previously demonstrated its important efficiency for visualization of biological data such as single-cell RNAseq data. Here, we develop a new method PoincaréMSA for visual representation of complex relationships between protein sequences based on Poincaré maps embedding. We demonstrate its efficiency and potential for visualization of protein family topology as well as evolutionary and functional annotation of uncharacterized sequences. PoincaréMSA is implemented in open source Python code with available interactive Google Colab notebooks as described at https://www.dsimb.inserm.fr/POINCARE_MSA.
Collapse
|
13
|
Crosstalk Between the Immune System and Plant-Derived Nanovesicles: A Study of Allergen Transporting. Front Bioeng Biotechnol 2021; 9:760730. [PMID: 34900959 PMCID: PMC8662998 DOI: 10.3389/fbioe.2021.760730] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2021] [Accepted: 10/05/2021] [Indexed: 12/12/2022] Open
Abstract
Background: Nanometer-sized membrane-surrounded vesicles from different parts of plants including fruits are gaining increasing attention due to their anti-inflammatory and anticancer effects demonstrated by in vitro and in vivo studies, and as nanovectors for molecular delivery of exogenous substances. These nanomaterials are very complex and contain a diverse arsenal of bioactive molecules, such as nucleic acids, proteins, and lipids. Our knowledge about the transport of allergens in vesicles isolated from plant food is limited today. Methods: Here, to investigate the allergenicity of strawberry-derived microvesicles (MVs), nanovesicles (NVs), and subpopulations of NV, we have set up a multidisciplinary approach. The strategy combines proteomics-based protein identification, immunological investigations, bioinformatics, and data mining to gain biological insights useful to evaluate the presence of potential allergens and the immunoglobulin E (IgE) inhibitory activity of vesicle preparations. Results: Immunological test showed that several proteins of strawberry-derived vesicles compete for IgE binding with allergens spotted on the FABER biochip. This includes the known strawberry allergens Fra a 1, Fra a 3, and Fra a 4, and also other IgE-binding proteins not yet described as allergens in this food, such as gibberellin-regulated proteins, 2S albumin, pectate lyase, and trypsin inhibitors. Proteomics identified homologous sequences of the three strawberry allergens and their isoforms in total protein extract (TPE) but only Fra a 1 and Fra a 4 in the vesicle samples. Label-free quantitative proteomic analysis revealed no significant enrichment of these proteins in strawberry vesicles with respect to TPE. Conclusion: Immunological tests and bioinformatics analysis of proteomics data sets revealed that MVs and NVs isolated from strawberries can carry functional allergens their isoforms as well as proteins potentially allergenic based on their structural features. This should be considered when these new nanomaterials are used for human nutraceutical or biomedical applications.
Collapse
|
14
|
Molecular and Structural Parallels between Gluten Pathogenic Peptides and Bacterial-Derived Proteins by Bioinformatics Analysis. Int J Mol Sci 2021; 22:9278. [PMID: 34502187 PMCID: PMC8430993 DOI: 10.3390/ijms22179278] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2021] [Revised: 08/23/2021] [Accepted: 08/25/2021] [Indexed: 02/08/2023] Open
Abstract
Gluten-related disorders (GRDs) are a group of diseases that involve the activation of the immune system triggered by the ingestion of gluten, with a worldwide prevalence of 5%. Among them, Celiac disease (CeD) is a T-cell-mediated autoimmune disease causing a plethora of symptoms from diarrhea and malabsorption to lymphoma. Even though GRDs have been intensively studied, the environmental triggers promoting the diverse reactions to gluten proteins in susceptible individuals remain elusive. It has been proposed that pathogens could act as disease-causing environmental triggers of CeD by molecular mimicry mechanisms. Additionally, it could also be possible that unrecognized molecular, structural, and physical parallels between gluten and pathogens have a relevant role. Herein, we report sequence, structural and physical similarities of the two most relevant gluten peptides, the 33-mer and p31-43 gliadin peptides, with bacterial pathogens using bioinformatics going beyond the molecular mimicry hypothesis. First, a stringent BLASTp search using the two gliadin peptides identified high sequence similarity regions within pathogen-derived proteins, e.g., extracellular proteins from Streptococcus pneumoniae and Granulicatella sp. Second, molecular dynamics calculations of an updated α-2-gliadin model revealed close spatial localization and solvent-exposure of the 33-mer and p31-43 peptide, which was compared with the pathogen-related proteins by homology models and localization predictors. We found putative functions of the identified pathogen-derived sequence by identifying T-cell epitopes and SH3/WW-binding domains. Finally, shape and size parallels between the pathogens and the superstructures of gliadin peptides gave rise to novel hypotheses about activation of innate immunity and dysbiosis. Based on our structural findings and the similarities with the bacterial pathogens, evidence emerges that these pathologically relevant gluten-derived peptides could behave as non-replicating pathogens opening new research questions in the interface of innate immunity, microbiome, and food research.
Collapse
|
15
|
TCRMatch: Predicting T-Cell Receptor Specificity Based on Sequence Similarity to Previously Characterized Receptors. Front Immunol 2021; 12:640725. [PMID: 33777034 PMCID: PMC7991084 DOI: 10.3389/fimmu.2021.640725] [Citation(s) in RCA: 42] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2020] [Accepted: 02/22/2021] [Indexed: 12/12/2022] Open
Abstract
The adaptive immune system in vertebrates has evolved to recognize non-self antigens, such as proteins expressed by infectious agents and mutated cancer cells. T cells play an important role in antigen recognition by expressing a diverse repertoire of antigen-specific receptors, which bind epitopes to mount targeted immune responses. Recent advances in high-throughput sequencing have enabled the routine generation of T-cell receptor (TCR) repertoire data. Identifying the specific epitopes targeted by different TCRs in these data would be valuable. To accomplish that, we took advantage of the ever-increasing number of TCRs with known epitope specificity curated in the Immune Epitope Database (IEDB) since 2004. We compared seven metrics of sequence similarity to determine their power to predict if two TCRs have the same epitope specificity. We found that a comprehensive k-mer matching approach produced the best results, which we have implemented into TCRMatch, an openly accessible tool (http://tools.iedb.org/tcrmatch/) that takes TCR β-chain CDR3 sequences as an input, identifies TCRs with a match in the IEDB, and reports the specificity of each match. We anticipate that this tool will provide new insights into T cell responses captured in receptor repertoire and single cell sequencing experiments and will facilitate the development of new strategies for monitoring and treatment of infectious, allergic, and autoimmune diseases, as well as cancer.
Collapse
|
16
|
Identification of Allelic Variation in Drought Responsive Dehydrin Gene Based on Sequence Similarity in Chickpea ( Cicer arietinum L.). Front Genet 2021; 11:584527. [PMID: 33381148 PMCID: PMC7767992 DOI: 10.3389/fgene.2020.584527] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2020] [Accepted: 11/18/2020] [Indexed: 11/13/2022] Open
Abstract
Chickpea (Cicer arietinum L.) is an economically important food legume grown in arid and semi-arid regions of the world. Chickpea is cultivated mainly in the rainfed, residual moisture, and restricted irrigation condition. The crop is always prone to drought stress which is resulting in flower drop, unfilled pods, and is a major yield reducer in many parts of the world. The present study elucidates the association between candidate gene and morpho-physiological traits for the screening of drought tolerance in chickpea. Abiotic stress-responsive gene Dehydrin (DHN) was identified in some of the chickpea genotypes based on the sequence similarity approach to play a major role in drought tolerance. Analysis of variance revealed a significant effect of drought on relative water content, membrane stability index, plant height, and yield traits. The genotypes Pusa1103, Pusa362, and ICC4958 were found most promising genotypes for drought tolerance as they maintained the higher value of osmotic regulations and yield characters. The results were further supported by a sequence similarity approach for the dehydrin gene when analyzed for the presence of single nucleotide polymorphisms (SNPs) and indels. Homozygous indels and single nucleotide polymorphisms were found after the sequencing in some of the selected genotypes.
Collapse
|
17
|
Pairwise sequence comparison data of the DNA barcodes of aquatic insects. Data Brief 2020; 32:106284. [PMID: 32995390 PMCID: PMC7502337 DOI: 10.1016/j.dib.2020.106284] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2020] [Accepted: 08/31/2020] [Indexed: 11/26/2022] Open
Abstract
This study compared the DNA sequences of cytochrome c oxidase subunit I (COI) and histone H3 of Ephemeroptera, Odonata, Plecoptera, and Trichoptera in a pairwise manner, and calculated the sequence similarities based on uncorrected P-distance (number of identical sites in both sequences per total number of the sites compared). Datasets of annotated sequences, the source organisms of which are identified at the species level in taxonomy, were retrieved from INSD (GenBank/ ENA/ DDBJ) as of the end of May 2020. Similarity scores of the pairwise comparison were sorted by the combinations of taxonomic groups; intraspecific variations, intrageneric-interspecific divergences, intrafamily-intergeneric divergences, and intraorder-interfamily divergences for Ephemeroptera, Odonata, Plecoptera, and Trichoptera. Similarity scores at the cumulative relative frequency points (1%, 5%, 10%, and median) may be used as the threshold to differentiate between the taxonomic groups based on sequence match. This is often done in the characterization of morphologically-unidentified specimens using barcode sequences, in the metabarcoding analysis of the local fauna, and environmental DNA analysis.
Collapse
|
18
|
Challenges in gene-oriented approaches for pangenome content discovery. Brief Bioinform 2020; 22:5901976. [PMID: 32893299 DOI: 10.1093/bib/bbaa198] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2020] [Revised: 05/14/2020] [Accepted: 08/04/2020] [Indexed: 01/17/2023] Open
Abstract
Given a group of genomes, represented as the sets of genes that belong to them, the discovery of the pangenomic content is based on the search of genetic homology among the genes for clustering them into families. Thus, pangenomic analyses investigate the membership of the families to the given genomes. This approach is referred to as the gene-oriented approach in contrast to other definitions of the problem that takes into account different genomic features. In the past years, several tools have been developed to discover and analyse pangenomic contents. Because of the hardness of the problem, each tool applies a different strategy for discovering the pangenomic content. This results in a differentiation of the performance of each tool that depends on the composition of the input genomes. This review reports the main analysis instruments provided by the current state of the art tools for the discovery of pangenomic contents. Moreover, unlike previous works, the presented study compares pangenomic tools from a methodological perspective, analysing the causes that lead a given methodology to outperform other tools. The analysis is performed by taking into account different bacterial populations, which are synthetically generated by changing evolutionary parameters. The benchmarks used to compare the pangenomic tools, in addition to the computational pipeline developed for this purpose, are available at https://github.com/InfOmics/pangenes-review. Contact: V. Bonnici, R. Giugno Supplementary information: Supplementary data are available at Briefings in Bioinformatics online.
Collapse
|
19
|
The use of local structural similarity of distant homologues for crystallographic model building from a molecular-replacement solution. Acta Crystallogr D Struct Biol 2020; 76:248-260. [PMID: 32133989 PMCID: PMC7057216 DOI: 10.1107/s2059798320000455] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2019] [Accepted: 01/14/2020] [Indexed: 12/18/2022] Open
Abstract
The performance of automated protein model building usually decreases with resolution, mainly owing to the lower information content of the experimental data. This calls for a more elaborate use of the available structural information about macromolecules. Here, a new method is presented that uses structural homologues to improve the quality of protein models automatically constructed using ARP/wARP. The method uses local structural similarity between deposited models and the model being built, and results in longer main-chain fragments that in turn can be more reliably docked to the protein sequence. The application of the homology-based model extension method to the example of a CFA synthase at 2.7 Å resolution resulted in a more complete model with almost all of the residues correctly built and docked to the sequence. The method was also evaluated on 1493 molecular-replacement solutions at a resolution of 4.0 Å and better that were submitted to the ARP/wARP web service for model building. A significant improvement in the completeness and sequence coverage of the built models has been observed.
Collapse
|
20
|
The comparative biochemistry of viruses and humans: an evolutionary path towards autoimmunity. Biol Chem 2019; 400:629-638. [PMID: 30504522 DOI: 10.1515/hsz-2018-0271] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2018] [Accepted: 11/07/2018] [Indexed: 11/15/2022]
Abstract
Analyses of the peptide sharing between five common human viruses (Borna disease virus, influenza A virus, measles virus, mumps virus and rubella virus) and the human proteome highlight a massive viral vs. human peptide overlap that is mathematically unexpected. Evolutionarily, the data underscore a strict relationship between viruses and the origin of eukaryotic cells. Indeed, according to the viral eukaryogenesis hypothesis and in light of the endosymbiotic theory, the first eukaryotic cell (our lineage) originated as a consortium consisting of an archaeal ancestor of the eukaryotic cytoplasm, a bacterial ancestor of the mitochondria and a viral ancestor of the nucleus. From a pathologic point of view, the peptide sequence similarity between viruses and humans may provide a molecular platform for autoimmune crossreactions during immune responses following viral infections/immunizations.
Collapse
|
21
|
Predicting bacterial virulence factors - evaluation of machine learning and negative data strategies. Brief Bioinform 2019; 21:1596-1608. [PMID: 32978619 DOI: 10.1093/bib/bbz076] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2019] [Revised: 05/17/2019] [Accepted: 06/01/2019] [Indexed: 11/12/2022] Open
Abstract
Bacterial proteins dubbed virulence factors (VFs) are a highly diverse group of sequences, whose only obvious commonality is the very property of being, more or less directly, involved in virulence. It is therefore tempting to speculate whether their prediction, based on direct sequence similarity (seqsim) to known VFs, could be enhanced or even replaced by using machine-learning methods. Specifically, when trained on a large and diverse set of VFs, such may be able to detect putative, non-trivial characteristics shared by otherwise unrelated VF families and therefore better predict novel VFs with insignificant similarity to each individual family. We therefore first reassess the performance of dimer-based Support Vector Machines, as used in the widely used MP3 method, in light of seqsim-only and seqsim/dimer-hybrid classifiers. We then repeat the analysis with a novel, considerably more diverse data set, also addressing the important problem of negative data selection. Finally, we move on to the real-world use case of proteome-wide VF prediction, outlining different approaches to estimating specificity in this scenario. We find that direct seqsim is of unparalleled importance and therefore should always be exploited. Further, we observe strikingly low correlations between different feature and classifier types when ranking proteins by VF likeness. We therefore propose a 'best of each world' approach to prioritize proteins for experimental testing, focussing on the top predictions of each classifier. Further, classifiers for individual VF families should be developed.
Collapse
|
22
|
Target sequence requirements of a type III-B CRISPR-Cas immune system. J Biol Chem 2019; 294:10290-10299. [PMID: 31110048 DOI: 10.1074/jbc.ra119.008728] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2019] [Revised: 05/07/2019] [Indexed: 12/26/2022] Open
Abstract
CRISPR-Cas systems are RNA-based immune systems that protect many prokaryotes from invasion by viruses and plasmids. Type III CRISPR systems are unique, as their targeting mechanism requires target transcription. Upon transcript binding, DNA cleavage by type III effector complexes is activated. Type III systems must differentiate between invader and native transcripts to prevent autoimmunity. Transcript origin is dictated by the sequence that flanks the 3' end of the RNA target site (called the PFS). However, how the PFS is recognized may vary among different type III systems. Here, using purified proteins and in vitro assays, we define how the type III-B effector from the hyperthermophilic bacterium Thermotoga maritima discriminates between native and invader transcripts. We show that native transcripts are recognized by base pairing at positions -2 to -5 of the PFS and by a guanine at position -1, which is not recognized by base pairing. We also show that mismatches with the RNA target are highly tolerated in this system, except for those nucleotides adjacent to the PFS. These findings define the target requirement for the type III-B system from T. maritima and provide a framework for understanding the target requirements of type III systems as a whole.
Collapse
|
23
|
Enzyme annotation for orphan and novel reactions using knowledge of substrate reactive sites. Proc Natl Acad Sci U S A 2019; 116:7298-7307. [PMID: 30910961 PMCID: PMC6462048 DOI: 10.1073/pnas.1818877116] [Citation(s) in RCA: 45] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Recent advances in synthetic biochemistry have resulted in a wealth of novel hypothetical enzymatic reactions that are not matched to protein-encoding genes, deeming them “orphan.” A large number of known metabolic enzymes are also orphan, leaving important gaps in metabolic network maps. Proposing genes for the catalysis of orphan reactions is critical for applications ranging from biotechnology to medicine. In this work, the computational method BridgIT identified potential enzymes of orphan reactions and nearly all theoretically possible biochemical transformations, providing candidate genes to catalyze these reactions to the research community. The BridgIT online tool will allow researchers to fill the knowledge gaps in metabolic networks and will act as a starting point for designing novel enzymes to catalyze nonnatural transformations. Thousands of biochemical reactions with characterized activities are “orphan,” meaning they cannot be assigned to a specific enzyme, leaving gaps in metabolic pathways. Novel reactions predicted by pathway-generation tools also lack associated sequences, limiting protein engineering applications. Associating orphan and novel reactions with known biochemistry and suggesting enzymes to catalyze them is a daunting problem. We propose the method BridgIT to identify candidate genes and catalyzing proteins for these reactions. This method introduces information about the enzyme binding pocket into reaction-similarity comparisons. BridgIT assesses the similarity of two reactions, one orphan and one well-characterized nonorphan reaction, using their substrate reactive sites, their surrounding structures, and the structures of the generated products to suggest enzymes that catalyze the most-similar nonorphan reactions as candidates for also catalyzing the orphan ones. We performed two large-scale validation studies to test BridgIT predictions against experimental biochemical evidence. For the 234 orphan reactions from the Kyoto Encyclopedia of Genes and Genomes (KEGG) 2011 (a comprehensive enzymatic-reaction database) that became nonorphan in KEGG 2018, BridgIT predicted the exact or a highly related enzyme for 211 of them. Moreover, for 334 of 379 novel reactions in 2014 that were later cataloged in KEGG 2018, BridgIT predicted the exact or highly similar enzymes. BridgIT requires knowledge about only four connecting bonds around the atoms of the reactive sites to correctly annotate proteins for 93% of analyzed enzymatic reactions. Increasing to seven connecting bonds allowed for the accurate identification of a sequence for nearly all known enzymatic reactions.
Collapse
|
24
|
A Content-Based Retrieval Framework for Whole Metagenome Sequencing Samples. J Integr Bioinform 2018; 15:/j/jib.ahead-of-print/jib-2017-0067/jib-2017-0067.xml. [PMID: 30367805 PMCID: PMC6348744 DOI: 10.1515/jib-2017-0067] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2017] [Accepted: 04/11/2018] [Indexed: 11/15/2022] Open
Abstract
Finding similarities and differences between metagenomic samples within large repositories has been rather a significant issue for researchers. Over the recent years, content-based retrieval has been suggested by various studies from different perspectives. In this study, a content-based retrieval framework for identifying relevant metagenomic samples is developed. The framework consists of feature extraction, selection methods and similarity measures for whole metagenome sequencing samples. Performance of the developed framework was evaluated on given samples. A ground truth was used to evaluate the system performance such that if the system retrieves patients with the same disease, -called positive samples-, they are labeled as relevant samples otherwise irrelevant. The experimental results show that relevant experiments can be detected by using different fingerprinting approaches. We observed that Latent Semantic Analysis (LSA) Method is a promising fingerprinting approach for representing metagenomic samples and finding relevance among them. Source codes and executable files are available at www.baskent.edu.tr/∼hogul/WMS_retrieval.rar.
Collapse
|
25
|
Systematic Identification and Classification of β-Lactamases Based on Sequence Similarity Criteria: β-Lactamase Annotation. Evol Bioinform Online 2018; 14:1176934318797351. [PMID: 30210232 PMCID: PMC6131288 DOI: 10.1177/1176934318797351] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2018] [Accepted: 08/08/2018] [Indexed: 12/11/2022] Open
Abstract
β-lactamases, the enzymes responsible for resistance to β-lactam antibiotics, are
widespread among prokaryotic genera. However, current β-lactamase classification
schemes do not represent their present diversity. Here, we propose a workflow to
identify and classify β-lactamases. Initially, a set of curated sequences was
used as a model for the construction of profiles Hidden Markov Models (HMM),
specific for each β-lactamase class. An extensive, nonredundant set of
β-lactamase sequences was constructed from 7 different resistance proteins
databases to test the methodology. The profiles HMM were improved for their
specificity and sensitivity and then applied to fully assembled genomes. Five
hierarchical classification levels are described, and a new class of
β-lactamases with fused domains is proposed. Our profiles HMM provide a better
annotation of β-lactamases, with classes and subclasses defined by objective
criteria such as sequence similarity. This classification offers a solid base to
the elaboration of studies on the diversity, dispersion, prevalence, and
evolution of the different classes and subclasses of this critical enzymatic
activity.
Collapse
|
26
|
A Novel Strategy for Detecting Recent Horizontal Gene Transfer and Its Application to Rhizobium Strains. Front Microbiol 2018; 9:973. [PMID: 29867876 PMCID: PMC5968381 DOI: 10.3389/fmicb.2018.00973] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2018] [Accepted: 04/25/2018] [Indexed: 11/13/2022] Open
Abstract
Recent horizontal gene transfer (HGT) is crucial for enabling microbes to rapidly adapt to their novel environments without relying upon rare beneficial mutations that arise spontaneously. For several years now, computational approaches have been developed to detect HGT, but they typically lack the sensitivity and ability to detect recent HGT events. Here we introduce a novel strategy, named RecentHGT. The number of genes undergoing recent HGT between two bacterial genomes was estimated by a new algorithm derived from the expectation-maximization algorithm and is based on the theoretical sequence-similarity distribution of orthologous genes. We tested the proposed strategy by applying it to a set of 10 Rhizobium genomes, and detected several large-scale recent HGT events. We also found that our strategy was more sensitive than other available HGT detection methods. These HGT events were mainly mediated by symbiotic plasmids. Our new strategy can provide clear evidence of recent HGT events and thus it brings us closer to the goal of detecting these potentially adaptive evolution processes in rhizobia as well as pathogens.
Collapse
|
27
|
Careful use of 16S rRNA gene sequence similarity values for the identification of Mycobacterium species. New Microbes New Infect 2017; 22:24-29. [PMID: 29556405 PMCID: PMC5857167 DOI: 10.1016/j.nmni.2017.12.009] [Citation(s) in RCA: 30] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2017] [Accepted: 12/20/2017] [Indexed: 11/03/2022] Open
Abstract
In order to evaluate the suitability of 16S rRNA nucleotide sequence similarity for the classification of new Mycobacterium isolates at the species level, we systematically studied the pairwise identity values of this gene for 131 Mycobacterium species with standing in nomenclature. Only one of the studied species, M. poriferae (0.76%), strictly respected the 95% and 98.65% threshold values currently recommended to determine the affiliation of bacterial isolates to an existing or new genus or species, respectively. All other species exhibited at least an identity value >98.65% and/or <95% with another Mycobacterium species. Therefore, we suggest that interpretation of interspecies 16S rRNA identity values should be made cautiously when classifying a new mycobacterial isolate at the species level.
Collapse
|
28
|
Heterogeneous molecular processes among the causes of how sequence similarity scores can fail to recapitulate phylogeny. Brief Bioinform 2017; 18:451-457. [PMID: 27103098 PMCID: PMC5429007 DOI: 10.1093/bib/bbw034] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2016] [Indexed: 11/24/2022] Open
Abstract
Sequence similarity tools like Basic Local Alignment Search Tool (BLAST) are essential components of many functional genetic, genomic, phylogenetic and bioinformatic studies. Many modern analysis pipelines use significant sequence similarity scores (p- or E-values) and the ranked order of BLAST matches to test a wide range of hypotheses concerning homology, orthology, the timing of de novo gene birth/death and gene family expansion/contraction. Despite significant contrary findings, many of these tests still implicitly assume that stronger or higher-ranked E-value scores imply closer phylogenetic relationships between sequences. Here, we demonstrate that even though a general relationship does exist between the phylogenetic distance of two sequences and their E-value, significant and misleading errors occur in both the completeness and the order of results under realistic evolutionary scenarios. These results provide additional details to past evidence showing that studies should avoid drawing direct inferences of evolutionary relatedness from measures of sequence similarity alone, and should instead, where possible, use more rigorous phylogeny-based methods.
Collapse
|
29
|
Abstract
Under aggregation-prone conditions, soluble amyloidogenic protein monomers can self-assemble into fibrils or they can fibrillize on preformed fibrillar seeds (seeded aggregation). Seeded aggregations are known to propagate the morphology of the seeds in the event of cross-seeding. However, not all proteins are known to cross-seed aggregation. Cross-seeding has been proposed to be restricted either because of differences in the protein sequences or because of conformations between the seeds and the soluble monomers. Here, we examine cross-seeding efficiency between three α-synuclein sequences, wild-type, A30P, and A53T, each varying in only one or two amino acids but forming morphologically distinct fibrils. Results from bulk Thioflavin-T measurements, monomer incorporation quantification, single fibril fluorescence microscopy, and atomic force microscopy show that under the given solution conditions conformity between the conformation of seeds and monomers is essential for seed elongation. Moreover, elongation characteristics of the seeds are defined by the type of seed.
Collapse
|
30
|
Abstract
Research in the recent decade has demonstrated the usefulness of protein network knowledge in furthering the study of molecular evolution of proteins, understanding the robustness of cells to perturbation, and annotating new protein functions. In this study, we aimed to provide a general clustering approach to visualize the sequence-structure-function relationship of protein networks, and investigate possible causes for inconsistency in the protein classifications based on sequences, structures, and functions. Such visualization of protein networks could facilitate our understanding of the overall relationship among proteins and help researchers comprehend various protein databases. As a demonstration, we clustered 1437 enzymes by their sequences and structures using the minimum span clustering (MSC) method. The general structure of this protein network was delineated at two clustering resolutions, and the second level MSC clustering was found to be highly similar to existing enzyme classifications. The clustering of these enzymes based on sequence, structure, and function information is consistent with each other. For proteases, the Jaccard's similarity coefficient is 0.86 between sequence and function classifications, 0.82 between sequence and structure classifications, and 0.78 between structure and function classifications. From our clustering results, we discussed possible examples of divergent evolution and convergent evolution of enzymes. Our clustering approach provides a panoramic view of the sequence-structure-function network of proteins, helps visualize the relation between related proteins intuitively, and is useful in predicting the structure and function of newly determined protein sequences.
Collapse
|
31
|
Complete mitochondrial genome of Ctenopharyngodon idella var. Gold grass carp and its intraspecific comparison. Mitochondrial DNA A DNA Mapp Seq Anal 2016; 28:372-374. [PMID: 27211306 DOI: 10.3109/19401736.2015.1126828] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
We determined the complete mitochondrial genome (Mitogome) of Ctenopharyngodon idella var. gold grass carp (C. idella var.) (Accession No.:KT894100) and compered it with a previously published mitogenome (Accession No.:NC010288.1) of Chinese common grass carp. The total length of mitogenome is 16 609 bp with 13 protein-coding genes, 2 rRNA genes, 22 tRNA genes and a control region. The nucleotide compositions of the L-strand are 31.87% for A, 26.20% for T, 15.68% for G, 26.26% for C. Most of the protein-coding genes began with an ATG start codon except for COX1 and ND3 gene. The sequence similarity was 99.81% between each other. There were 32 variation loci obtained, including of 23 transitions, 7 transversions and 2 INDELs.
Collapse
|
32
|
Abstract
Alignment-free sequence comparison methods are attracting persistent interest, driven by data-intensive applications in genome-wide molecular taxonomy and phylogenetic reconstruction. Among all the methods based on substring composition, the average common substring (ACS) measure admits a straightforward linear time sequence comparison algorithm, while yielding impressive results in multiple applications. An important direction of this research is to extend the approach to permit a bounded edit/hamming distance between substrings, so as to reflect more accurately the evolutionary process. To date, however, algorithms designed to incorporate k ≥ 1 mismatches have O(n(2)) worst-case time complexity, where n is the total length of the input sequences. On the other hand, accounting for mismatches has shown to lead to much improved classification, while heuristics can improve practical performance. In this article, we close the gap by presenting the first provably efficient algorithm for the k-mismatch average common string (ACSk) problem that takes O(n) space and O(n log(k) n) time in the worst case for any constant k. Our method extends the generalized suffix tree model to incorporate a carefully selected bounded set of perturbed suffixes, and can be applied to other complex approximate sequence matching problems.
Collapse
|
33
|
Optimizing high performance computing workflow for protein functional annotation. CONCURRENCY AND COMPUTATION : PRACTICE & EXPERIENCE 2014; 26:2112-2121. [PMID: 25313296 PMCID: PMC4194055 DOI: 10.1002/cpe.3264] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
Functional annotation of newly sequenced genomes is one of the major challenges in modern biology. With modern sequencing technologies, the protein sequence universe is rapidly expanding. Newly sequenced bacterial genomes alone contain over 7.5 million proteins. The rate of data generation has far surpassed that of protein annotation. The volume of protein data makes manual curation infeasible, whereas a high compute cost limits the utility of existing automated approaches. In this work, we present an improved and optmized automated workflow to enable large-scale protein annotation. The workflow uses high performance computing architectures and a low complexity classification algorithm to assign proteins into existing clusters of orthologous groups of proteins. On the basis of the Position-Specific Iterative Basic Local Alignment Search Tool the algorithm ensures at least 80% specificity and sensitivity of the resulting classifications. The workflow utilizes highly scalable parallel applications for classification and sequence alignment. Using Extreme Science and Engineering Discovery Environment supercomputers, the workflow processed 1,200,000 newly sequenced bacterial proteins. With the rapid expansion of the protein sequence universe, the proposed workflow will enable scientists to annotate big genome data.
Collapse
|
34
|
Intraspecific comparison of complete mitogenome sequences from two Asian raccoon dogs (Canidae: Nyctereutes procyonoides). MITOCHONDRIAL DNA 2014; 26:827-828. [PMID: 24409844 DOI: 10.3109/19401736.2013.855913] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
We determined the complete mitochondrial genome (GenBank accession number: KF709435) of the Korean raccoon dog Nyctereutes procyonoides koreensis and compared it with a previously published mitogenome (GenBank accession number: GU256221) of a Chinese raccoon dog. The total length of N. p. koreensis mitogenome is 16,802 bp, with a base composition of 32.1% A, 26.9% T, 26.8% C and 14.2% G. High similarity of 98.7% was found between the complete mitogenome sequences of Korean and Chinese raccoon dogs. Sequence similarity of the two mitogenomes was 99.3% in the other gene regions except for D-loop. The sequence similarity of 99.1% was found in the 13 protein-coding gene regions, whereas 99.6% was identical in mtDNA regions covering all the 22 tRNA genes. There was no variation between 12S rRNAs, whereas 0.5% difference was found between 16S rRNAs.
Collapse
|
35
|
mrtailor: a tool for PDB-file preparation for the generation of external restraints. ACTA CRYSTALLOGRAPHICA SECTION D: BIOLOGICAL CRYSTALLOGRAPHY 2013; 69:1861-3. [PMID: 23999309 DOI: 10.1107/s090744491301648x] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/20/2013] [Accepted: 06/13/2013] [Indexed: 11/11/2022]
Abstract
Model building starting from, for example, a molecular-replacement solution with low sequence similarity introduces model bias, which can be difficult to detect, especially at low resolution. The program mrtailor removes low-similarity regions from a template PDB file according to sequence similarity between the target sequence and the template sequence and maps the target sequence onto the PDB file. The modified PDB file can be used to generate external restraints for low-resolution refinement with reduced model bias and can be used as a starting point for model building and refinement. The program can call ProSMART [Nicholls et al. (2012), Acta Cryst. D68, 404-417] directly in order to create external restraints suitable for REFMAC5 [Murshudov et al. (2011), Acta Cryst. D67, 355-367]. Both a command-line version and a GUI exist.
Collapse
|
36
|
An introduction to sequence similarity ("homology") searching. CURRENT PROTOCOLS IN BIOINFORMATICS 2013; Chapter 3:3.1.1-3.1.8. [PMID: 23749753 PMCID: PMC3820096 DOI: 10.1002/0471250953.bi0301s42] [Citation(s) in RCA: 407] [Impact Index Per Article: 37.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
Abstract
Sequence similarity searching, typically with BLAST, is the most widely used and most reliable strategy for characterizing newly determined sequences. Sequence similarity searches can identify "homologous" proteins or genes by detecting excess similarity- statistically significant similarity that reflects common ancestry. This unit provides an overview of the inference of homology from significant similarity, and introduces other units in this chapter that provide more details on effective strategies for identifying homologs.
Collapse
|
37
|
Using BLAT to find sequence similarity in closely related genomes. CURRENT PROTOCOLS IN BIOINFORMATICS 2012; Chapter 10:10.8.1-10.8.24. [PMID: 22389010 PMCID: PMC4101998 DOI: 10.1002/0471250953.bi1008s37] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
The BLAST-Like Alignment Tool (BLAT) is used to find genomic sequences that match a protein or DNA sequence submitted by the user. BLAT is typically used for searching similar sequences within the same or closely related species. It was developed to align millions of expressed sequence tags and mouse whole-genome random reads to the human genome at a higher speed. It is freely available either on the Web or as a downloadable stand-alone program. BLAT search results provide a link for visualization in the University of California, Santa Cruz (UCSC) Genome Browser, where associated biological information may be obtained. Three example protocols are given: using an mRNA sequence to identify the exon-intron locations and associated gene in the genomic sequence of the same species, using a protein sequence to identify the coding regions in a genomic sequence and to search for gene family members in the same species, and using a protein sequence to find homologs in another species.
Collapse
|
38
|
Abstract
Thirty viral proteomes were examined for amino acid sequence similarity to the human proteome, and, in parallel, a control of 30 sets of human proteins was analyzed for internal human overlapping. We find that all of the analyzed 30 viral proteomes, independently of their structural or pathogenic characteristics, present a high number of pentapeptide overlaps to the human proteome. Among the examined viruses, human T-lymphotropic virus 1, Rubella virus, and hepatitis C virus present the highest number of viral overlaps to the human proteome. The widespread and ample distribution of viral amino acid sequences through the human proteome indicates that viral and human proteins are formed of common peptide backbone units and suggests a fluid compositional chimerism in phylogenetic entities canonically classified distantly as viruses and Homo sapiens. Importantly, the massive viral to human peptide overlapping calls into question the possibility of a direct causal association between virus-host sharing of amino acid sequences and incitement to autoimmune reactions through molecular recognition of common motifs.
Collapse
|
39
|
Prediction of protein-protein interactions on the basis of evolutionary conservation of protein functions. Evol Bioinform Online 2007; 3:197-206. [PMID: 19461979 PMCID: PMC2684133] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022] Open
Abstract
MOTIVATION Although a great deal of progress is being made in the development of fast and reliable experimental techniques to extract genome-wide networks of protein-protein and protein-DNA interactions, the sequencing of new genomes proceeds at an even faster rate. That is why there is a considerable need for reliable methods of in-silico prediction of protein interaction based solely on sequence similarity information and known interactions from well-studied organisms. This problem can be solved if a dependency exists between sequence similarity and the conservation of the proteins' functions. RESULTS In this paper, we introduce a novel probabilistic method for prediction of protein-protein interactions using a new empirical probabilistic formula describing the loss of interactions between homologous proteins during the course of evolution. This formula describes an evolutional process quite similar to the process of the Earth's population growth. In addition, our method favors predictions confirmed by several interacting pairs over predictions coming from a single interacting pair. Our approach is useful in working with "noisy" data such as those coming from high-throughput experiments. We have generated predictions for five "model" organisms: H. sapiens, D. melanogaster, C. elegans, A. thaliana, and S. cerevisiae and evaluated the quality of these predictions.
Collapse
|
40
|
Prediction of interacting proteins from homology-modeled complex structures using sequence and structure scores. Biophysics (Nagoya-shi) 2007; 3:13-26. [PMID: 27857563 PMCID: PMC5036659 DOI: 10.2142/biophysics.3.13] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2007] [Accepted: 05/31/2007] [Indexed: 12/01/2022] Open
Abstract
Protein-protein interactions support most biological processes, and it is important to find specifically interacting partner proteins among homologous proteins in order to elucidate cellular functions such as signal transduction systems. Various high-throughput experimental methods for identifying these interactions have been invented, and used to generate a huge amount of data. Because these experiments have been applied to only a few organisms, and their accuracy is believed to be limited, it would be valuable to develop computational methods for predicting protein-protein interactions from their amino acid sequences or tertiary structural information. In this study, we describe a prediction method of interacting proteins based on homology-modeled complex structures. We employed the statistical residue-residue contact energy used in a previous study, and two types of new scores, simple electrostatic energy and sequence similarity between target sequences and template structures. The validity of each protein-protein complex model was measured using their single and combined scores. We applied our method to all the protein heterodimers of Saccharomyces cerevisiae. To evaluate the prediction performance of our method, we prepared two types of protein-protein interaction dataset: a complete dataset and high confidence dataset. The complete dataset (10,325 protein dimer models) contains all the yeast protein heterodimers whose complex structures can be modeled. Among them, pairs registered in the DIP database are defined as interacting pairs, and those not registered are defined as non-interacting protein pairs. The high confidence dataset (3,219 protein dimer models) is a more reliable subset of the complete dataset extracted using the criteria of the common subcellular localization. Both datasets show that sequence similarity has a much higher discrimination power than the other structure-based scores, but that the inclusion of contact energy results in significant improvement over predictions using sequence similarity alone. These results suggest that the sequence similarity is indispensable for the prediction, whereas structure scores can play supporting roles.
Collapse
|
41
|
Abstract
Fold recognition predicts protein three-dimensional structure by establishing relationships between a protein sequence and known protein structures. Most methods explicitly use information derived from the secondary and tertiary structure of the templates. Here we show that rigorous application of a sequence search method (PSI-BLAST) with no reference to secondary or tertiary structure information is able to perform as well as traditional fold recognition methods. Since the method, SENSER, does not require knowledge of the three-dimensional structure, it can be used to infer relationships that are not tractable by methods dependent on structural templates.
Collapse
|
42
|
Abstract
Previously, we demonstrated that the DE072 strain of IBV is a recombinant which has an IBV strain D1466-like sequence in the S gene. Herein, we analyzed the remaining 3.8 kb 3' end of the genome, which includes Gene 3, Gene 4, Gene 5, Gene 6, and the 3' non-coding region of the DE072 and D1466 strains. Those two viruses had high nucleotide similarity in Gene 4. However, the other individual genes had a much different level of sequence similarity with the same gene of the other IBV strains. The genome of five IBV strains, of which the complete sequence of the 3' end of the genome has been determined, were divided at an intergenic (IG) consensus sequence (CTGAACAA or CTTAACAA) and compared phylogenetically. Phylogenetic trees of different topology indicated that the consensus IG sequences and the highly conserved sequence around this regions may serve as recombination 'hot spots'. Phylogenetic analysis of selected regions of the genome of the DE072 serotype field isolates further support those results and indicate that isolates within the same serotype may have different amounts of nucleotide sequence similarity with each other in individual genes other than the S gene. Presumably this occurs because the consensus IG sequence serves as the template switching site for the viral encoded polymerase.
Collapse
|
43
|
Probable reassortment of genomic elements among elongated RNA-containing plant viruses. J Mol Evol 1989; 29:52-62. [PMID: 2504930 PMCID: PMC7087513 DOI: 10.1007/bf02106181] [Citation(s) in RCA: 81] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/1988] [Revised: 10/26/1988] [Indexed: 01/01/2023]
Abstract
The relationships of genome organization among elongated (rod-shaped and filamentous) plant viruses have been analyzed. Sequences in coding and noncoding regions of barley stripe mosaic virus (BSMV) RNAs 1, 2, and 3 were compared with those of the monopartite RNA genomes of potato virus X (PVX), white clover mosaic virus (WClMV), and tobacco mosaic virus, the bipartite genome of tobacco rattle virus (TRV), the quadripartite genome of beet necrotic yellow vein virus (BNYVV), and icosahedral tricornaviruses. These plant viruses belong to a supergroup having 5'-capped genomic RNAs. The results suggest that the genomic elements in each BSMV RNA are phylogenetically related to those of different plant RNA viruses. RNA 1 resembles the corresponding RNA 1 of tricornaviruses. The putative proteins encoded in BSMV RNA 2 are related to the products of BNYVV RNA 2, PVX RNA, and WClMV RNA. Amino acid sequence comparisons suggest that BSMV RNA 3 resembles TRV RNA 1. Also, it can be proposed that in the case of monopartite genomes, as a rule, every gene or block of genes retains phylogenetic relationships that are independent of adjacent genomic elements of the same RNA. Such differential evolution of individual elements of one and the same viral genome implies a prominent role for gene reassortment in the formation of viral genetic systems.
Collapse
|
44
|
Abstract
Analyses of immunoglobulin (Ig) variable (V) region gene usage in the immune response, estimates of V gene germline complexity, and other nucleic acid hybridization-based studies depend on the extent to which such genes are related (i.e., sequence similarity) and their organization in gene families. While mouse Igh heavy chain V region (VH) gene families are relatively well-established, a corresponding systematic classification of Igk light chain V region (Vk) genes has not been reported. The present analysis, in the course of which we reviewed the known extent of the Vk germline gene repertoire and Vk gene usage in a variety of responses to foreign and self antigens, provides a classification of mouse Vk genes in gene families composed of members with greater than 80% overall nucleic acid sequence similarity. This classification differed in several aspects from that of VH genes: only some Vk gene families were as clearly separated (by greater than 25% sequence dissimilarity) as typical VH gene families; most Vk gene families were closely related and, in several instances, members from different families were very similar (greater than 80%) over large sequence portions; frequently, classification by nucleic acid sequence similarity diverged from existing classifications based on amino-terminal protein sequence similarity. Our data have implications for Vk gene analyses by nucleic acid hybridization and describe potentially important differences in sequence organization between VH and Vk genes.
Collapse
|