1
|
Idrees S, Paudel KR, Sadaf T, Hansbro PM. Uncovering domain motif interactions using high-throughput protein-protein interaction detection methods. FEBS Lett 2024; 598:725-742. [PMID: 38439692 DOI: 10.1002/1873-3468.14841] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2023] [Revised: 01/09/2024] [Accepted: 02/18/2024] [Indexed: 03/06/2024]
Abstract
Protein-protein interactions (PPIs) are often mediated by short linear motifs (SLiMs) in one protein and domain in another, known as domain-motif interactions (DMIs). During the past decade, SLiMs have been studied to find their role in cellular functions such as post-translational modifications, regulatory processes, protein scaffolding, cell cycle progression, cell adhesion, cell signalling and substrate selection for proteasomal degradation. This review provides a comprehensive overview of the current PPI detection techniques and resources, focusing on their relevance to capturing interactions mediated by SLiMs. We also address the challenges associated with capturing DMIs. Moreover, a case study analysing the BioGrid database as a source of DMI prediction revealed significant known DMI enrichment in different PPI detection methods. Overall, it can be said that current high-throughput PPI detection methods can be a reliable source for predicting DMIs.
Collapse
Affiliation(s)
- Sobia Idrees
- School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, Australia
- Centre for Inflammation, Centenary Institute and Faculty of Science, School of Life Sciences, University of Technology Sydney, Australia
| | - Keshav Raj Paudel
- Centre for Inflammation, Centenary Institute and Faculty of Science, School of Life Sciences, University of Technology Sydney, Australia
| | - Tayyaba Sadaf
- Centre for Inflammation, Centenary Institute and Faculty of Science, School of Life Sciences, University of Technology Sydney, Australia
| | - Philip M Hansbro
- Centre for Inflammation, Centenary Institute and Faculty of Science, School of Life Sciences, University of Technology Sydney, Australia
| |
Collapse
|
2
|
Juillerat-Jeanneret L, Tafelmeyer P, Golshayan D. Regulation of Fibroblast Activation Protein-α Expression: Focus on Intracellular Protein Interactions. J Med Chem 2021; 64:14028-14045. [PMID: 34523930 DOI: 10.1021/acs.jmedchem.1c01010] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
The prolyl-specific peptidase fibroblast activation protein-α (FAP-α) is expressed at very low or undetectable levels in nondiseased human tissues but is selectively induced in activated (myo)fibroblasts at sites of tissue remodeling in fibrogenic processes. In normal regenerative processes involving transient fibrosis FAP-α+(myo)fibroblasts disappear from injured tissues, replaced by cells with a normal FAP-α- phenotype. In chronic uncontrolled pathological fibrosis FAP-α+(myo)fibroblasts permanently replace normal tissues. The mechanisms of regulation and elimination of FAP-α expression in(myo)fibroblasts are unknown. According to a yeast two-hybrid screen and protein databanks search, we propose that the intracellular (co)-chaperone BAG6/BAT3 can interact with FAP-α, mediated by the BAG6/BAT3 Pro-rich domain, inducing proteosomal degradation of FAP-α protein under tissue homeostasis. In this Perspective, we discuss our findings in the context of current knowledge on the regulation of FAP-α expression and comment potential therapeutic strategies for uncontrolled fibrosis, including small molecule degraders (PROTACs)-modified FAP-α targeted inhibitors.
Collapse
Affiliation(s)
- Lucienne Juillerat-Jeanneret
- Transplantation Center and Transplantation Immunopathology Laboratory, Department of Medicine, Centre Hospitalier Universitaire Vaudois (CHUV) and University of Lausanne (UNIL), CH1011 Lausanne, Switzerland.,University Institute of Pathology, CHUV and UNIL, CH1011 Lausanne, Switzerland
| | - Petra Tafelmeyer
- Hybrigenics Services, Laboratories and Headquarters-Paris, 1 rue Pierre Fontaine, 91000 Evry, France.,Hybrigenics Corporation, Cambridge Innovation Center, 50 Milk Street, Cambridge, Massachusetts 02142, United States
| | - Dela Golshayan
- Transplantation Center and Transplantation Immunopathology Laboratory, Department of Medicine, Centre Hospitalier Universitaire Vaudois (CHUV) and University of Lausanne (UNIL), CH1011 Lausanne, Switzerland
| |
Collapse
|
3
|
Mier P, Andrade-Navarro MA. Avoided motifs: short amino acid strings missing from protein datasets. Biol Chem 2021; 402:945-951. [PMID: 33660494 DOI: 10.1515/hsz-2020-0383] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2020] [Accepted: 02/19/2021] [Indexed: 11/15/2022]
Abstract
According to the amino acid composition of natural proteins, it could be expected that all possible sequences of three or four amino acids will occur at least once in large protein datasets purely by chance. However, in some species or cellular context, specific short amino acid motifs are missing due to unknown reasons. We describe these as Avoided Motifs, short amino acid combinations missing from biological sequences. Here we identify 209 human and 154 bacterial Avoided Motifs of length four amino acids, and discuss their possible functionality according to their presence in other species. Furthermore, we determine two Avoided Motifs of length three amino acids in human proteins specifically located in the cytoplasm, and two more in secreted proteins. Our results support the hypothesis that the characterization of Avoided Motifs in particular contexts can provide us with information about functional motifs, pointing to a new approach in the use of molecular sequences for the discovery of protein function.
Collapse
Affiliation(s)
- Pablo Mier
- Faculty of Biology, Institute of Organismic and Molecular Evolution, Johannes Gutenberg University Mainz, Hanns-Dieter-Hüsch-Weg 15, D-55128 Mainz, Germany
| | - Miguel A Andrade-Navarro
- Faculty of Biology, Institute of Organismic and Molecular Evolution, Johannes Gutenberg University Mainz, Hanns-Dieter-Hüsch-Weg 15, D-55128 Mainz, Germany
| |
Collapse
|
4
|
Mako: A Graph-based Pattern Growth Approach to Detect Complex Structural Variants. GENOMICS PROTEOMICS & BIOINFORMATICS 2021; 20:205-218. [PMID: 34224879 PMCID: PMC9510932 DOI: 10.1016/j.gpb.2021.03.007] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/17/2021] [Revised: 03/05/2021] [Accepted: 03/05/2021] [Indexed: 11/21/2022]
Abstract
Complex structural variants (CSVs) are genomic alterations that have more than two breakpoints and are considered as the simultaneous occurrence of simple structural variants. However, detecting the compounded mutational signals of CSVs is challenging through a commonly used model-match strategy. As a result, there has been limited progress for CSV discovery compared with simple structural variants. Here, we systematically analyzed the multi-breakpoint connection feature of CSVs, and proposed Mako, utilizing a bottom-up guided model-free strategy, to detect CSVs from paired-end short-read sequencing. Specifically, we implemented a graph-based pattern growth approach, where the graph depicts potential breakpoint connections, and pattern growth enables CSV detection without pre-defined models. Comprehensive evaluations on both simulated and real datasets revealed that Mako outperformed other algorithms. Notably, validation rates of CSVs on real data based on experimental and computational validations as well as manual inspections are around 70%, where the medians of experimental and computational breakpoint shift are 13 bp and 26 bp, respectively. Moreover, the Mako CSV subgraph effectively characterized the breakpoint connections of a CSV event and uncovered a total of 15 CSV types, including two novel types of adjacent segment swap and tandem dispersed duplication. Further analysis of these CSVs also revealed the impact of sequence homology on the formation of CSVs. Mako is publicly available at https://github.com/xjtu-omics/Mako.
Collapse
|
5
|
PVTree: A Sequential Pattern Mining Method for Alignment Independent Phylogeny Reconstruction. Genes (Basel) 2019; 10:genes10020073. [PMID: 30678245 PMCID: PMC6410268 DOI: 10.3390/genes10020073] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2018] [Revised: 01/04/2019] [Accepted: 01/14/2019] [Indexed: 11/21/2022] Open
Abstract
Phylogenetic tree is essential to understand evolution and it is usually constructed through multiple sequence alignment, which suffers from heavy computational burdens and requires sophisticated parameter tuning. Recently, alignment free methods based on k-mer profiles or common substrings provide alternative ways to construct phylogenetic trees. However, most of these methods ignore the global similarities between sequences or some specific valuable features, e.g., frequent patterns overall datasets. To make further improvement, we propose an alignment free algorithm based on sequential pattern mining, where each sequence is converted into a binary representation of sequential patterns among sequences. The phylogenetic tree is further constructed via clustering distance matrix which is calculated from pattern vectors. To increase accuracy for highly divergent sequences, we consider pattern weight and filtering redundancy sub-patterns. Both simulated and real data demonstrates our method outperform other alignment free methods, especially for large sequence set with low similarity.
Collapse
|
6
|
Using machine learning tools for protein database biocuration assistance. Sci Rep 2018; 8:10148. [PMID: 29977071 PMCID: PMC6033909 DOI: 10.1038/s41598-018-28330-z] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2017] [Accepted: 06/21/2018] [Indexed: 12/30/2022] Open
Abstract
Biocuration in the omics sciences has become paramount, as research in these fields rapidly evolves towards increasingly data-dependent models. As a result, the management of web-accessible publicly-available databases becomes a central task in biological knowledge dissemination. One relevant challenge for biocurators is the unambiguous identification of biological entities. In this study, we illustrate the adequacy of machine learning methods as biocuration assistance tools using a publicly available protein database as an example. This database contains information on G Protein-Coupled Receptors (GPCRs), which are part of eukaryotic cell membranes and relevant in cell communication as well as major drug targets in pharmacology. These receptors are characterized according to subtype labels. Previous analysis of this database provided evidence that some of the receptor sequences could be affected by a case of label noise, as they appeared to be too consistently misclassified by machine learning methods. Here, we extend our analysis to recent and quite substantially modified new versions of the database and reveal their now extremely accurate labeling using several machine learning models and different transformations of the unaligned sequences. These findings support the adequacy of our proposed method to identify problematic labeling cases as a tool for database biocuration.
Collapse
|
7
|
Systematic discovery of complex insertions and deletions in human cancers. Nat Med 2015; 22:97-104. [PMID: 26657142 PMCID: PMC5003782 DOI: 10.1038/nm.4002] [Citation(s) in RCA: 70] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2015] [Accepted: 11/03/2015] [Indexed: 12/25/2022]
Abstract
Complex insertions and deletions (indels) are formed by simultaneously deleting and inserting DNA fragments of different sizes at a common genomic location. Here we present a systematic analysis of somatic complex indels in the coding sequences of samples from over 8,000 cancer cases using Pindel-C. We discovered 285 complex indels in cancer-associated genes (such as PIK3R1, TP53, ARID1A, GATA3 and KMT2D) in approximately 3.5% of cases analyzed; nearly all instances of complex indels were overlooked (81.1%) or misannotated (17.6%) in previous reports of 2,199 samples. In-frame complex indels are enriched in PIK3R1 and EGFR, whereas frameshifts are prevalent in VHL, GATA3, TP53, ARID1A, PTEN and ATRX. Furthermore, complex indels display strong tissue specificity (such as VHL in kidney cancer samples and GATA3 in breast cancer samples). Finally, structural analyses support findings of previously missed, but potentially druggable, mutations in the EGFR, MET and KIT oncogenes. This study indicates the critical importance of improving complex indel discovery and interpretation in medical research.
Collapse
|
8
|
Kroon M, Lameijer EW, Lakenberg N, Hehir-Kwa JY, Thung DT, Slagboom PE, Kok JN, Ye K. Detecting dispersed duplications in high-throughput sequencing data using a database-free approach. Bioinformatics 2015; 32:505-10. [PMID: 26508759 DOI: 10.1093/bioinformatics/btv621] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2015] [Accepted: 10/20/2015] [Indexed: 11/15/2022] Open
Abstract
MOTIVATION Dispersed duplications (DDs) such as transposon element insertions and copy number variations are ubiquitous in the human genome. They have attracted the interest of biologists as well as medical researchers due to their role in both evolution and disease. The efforts of discovering DDs in high-throughput sequencing data are currently dominated by database-oriented approaches that require pre-existing knowledge of the DD elements to be detected. RESULTS We present DD_DETECTION, a database-free approach to finding DD events in high-throughput sequencing data. DD_DETECTION is able to detect DDs purely from paired-end read alignments. We show in a comparative study that this method is able to compete with database-oriented approaches in recovering validated transposon insertion events. We also experimentally validate the predictions of DD_DETECTION on a human DNA sample, showing that it can find not only duplicated elements present in common databases but also DDs of novel type. AVAILABILITY AND IMPLEMENTATION The software presented in this article is open source and available from https://bitbucket.org/mkroon/dd_detection.
Collapse
Affiliation(s)
- M Kroon
- Department of Molecular Epidemiology, Leiden University Medical Center, Leiden
| | - E W Lameijer
- Department of Molecular Epidemiology, Leiden University Medical Center, Leiden
| | - N Lakenberg
- Department of Molecular Epidemiology, Leiden University Medical Center, Leiden
| | - J Y Hehir-Kwa
- Department of Human Genetics, Nijmegen Center for Molecular Life Sciences, Institute for Genetic and Metabolic Disease, Radboud University Nijmegen Medical Center, Nijmegen, Donders Centre for Neuroscience, Nijmegen, The Netherlands and
| | - D T Thung
- Department of Human Genetics, Nijmegen Center for Molecular Life Sciences, Institute for Genetic and Metabolic Disease, Radboud University Nijmegen Medical Center, Nijmegen
| | - P E Slagboom
- Department of Molecular Epidemiology, Leiden University Medical Center, Leiden
| | - J N Kok
- Department of Molecular Epidemiology, Leiden University Medical Center, Leiden
| | - K Ye
- Department of Molecular Epidemiology, Leiden University Medical Center, Leiden, The Genome Institute, Washington University, St Louis, MO 63108, USA
| |
Collapse
|
9
|
König C, Cárdenas MI, Giraldo J, Alquézar R, Vellido A. Label noise in subtype discrimination of class C G protein-coupled receptors: A systematic approach to the analysis of classification errors. BMC Bioinformatics 2015; 16:314. [PMID: 26415951 PMCID: PMC4587730 DOI: 10.1186/s12859-015-0731-9] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2015] [Accepted: 08/31/2015] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The characterization of proteins in families and subfamilies, at different levels, entails the definition and use of class labels. When the adscription of a protein to a family is uncertain, or even wrong, this becomes an instance of what has come to be known as a label noise problem. Label noise has a potentially negative effect on any quantitative analysis of proteins that depends on label information. This study investigates class C of G protein-coupled receptors, which are cell membrane proteins of relevance both to biology in general and pharmacology in particular. Their supervised classification into different known subtypes, based on primary sequence data, is hampered by label noise. The latter may stem from a combination of expert knowledge limitations and the lack of a clear correspondence between labels that mostly reflect GPCR functionality and the different representations of the protein primary sequences. RESULTS In this study, we describe a systematic approach, using Support Vector Machine classifiers, to the analysis of G protein-coupled receptor misclassifications. As a proof of concept, this approach is used to assist the discovery of labeling quality problems in a curated, publicly accessible database of this type of proteins. We also investigate the extent to which physico-chemical transformations of the protein sequences reflect G protein-coupled receptor subtype labeling. The candidate mislabeled cases detected with this approach are externally validated with phylogenetic trees and against further trusted sources such as the National Center for Biotechnology Information, Universal Protein Resource, European Bioinformatics Institute and Ensembl Genome Browser information repositories. CONCLUSIONS In quantitative classification problems, class labels are often by default assumed to be correct. Label noise, though, is bound to be a pervasive problem in bioinformatics, where labels may be obtained indirectly through complex, many-step similarity modelling processes. In the case of G protein-coupled receptors, methods capable of singling out and characterizing those sequences with consistent misclassification behaviour are required to minimize this problem. A systematic, Support Vector Machine-based method has been proposed in this study for such purpose. The proposed method enables a filtering approach to the label noise problem and might become a support tool for database curators in proteomics.
Collapse
Affiliation(s)
- Caroline König
- Dept. of Computer Science, Univ. Politècnica de Catalunya, C. Jordi Girona, 1-3, Barcelona, 08034, Spain.
| | - Martha I Cárdenas
- Dept. of Computer Science, Univ. Politècnica de Catalunya, C. Jordi Girona, 1-3, Barcelona, 08034, Spain. .,Institut de Neurociències, Unitat de Bioestadística, Univ. Autònoma de Barcelona, Cerdanyola del Vallès, Barcelona, 08193, Spain.
| | - Jesús Giraldo
- Institut de Neurociències, Unitat de Bioestadística, Univ. Autònoma de Barcelona, Cerdanyola del Vallès, Barcelona, 08193, Spain.
| | - René Alquézar
- Dept. of Computer Science, Univ. Politècnica de Catalunya, C. Jordi Girona, 1-3, Barcelona, 08034, Spain. .,Institut de Robòtica i Informàtica Industrial, CSIC-UPC, Barcelona, 08034, Spain.
| | - Alfredo Vellido
- Dept. of Computer Science, Univ. Politècnica de Catalunya, C. Jordi Girona, 1-3, Barcelona, 08034, Spain. .,Centro de Investigación Biomédica en Red en Bioingeniería, Biomateriales y Nanomedicina (CIBER-BBN), Cerdanyola del Vallès, Barcelona, 08193, Spain.
| |
Collapse
|
10
|
Abstract
High-throughput DNA sequencing has revolutionized the study of cancer genomics with numerous discoveries that are relevant to cancer diagnosis and treatment. The latest sequencing and analysis methods have successfully identified somatic alterations, including single-nucleotide variants, insertions and deletions, copy-number aberrations, structural variants and gene fusions. Additional computational techniques have proved useful for defining the mutations, genes and molecular networks that drive diverse cancer phenotypes and that determine clonal architectures in tumour samples. Collectively, these tools have advanced the study of genomic, transcriptomic and epigenomic alterations in cancer, and their association to clinical properties. Here, we review cancer genomics software and the insights that have been gained from their application.
Collapse
|
11
|
Zhang Y, Lameijer EW, 't Hoen PAC, Ning Z, Slagboom PE, Ye K. PASSion: a pattern growth algorithm-based pipeline for splice junction detection in paired-end RNA-Seq data. ACTA ACUST UNITED AC 2012; 28:479-86. [PMID: 22219203 PMCID: PMC3278765 DOI: 10.1093/bioinformatics/btr712] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION RNA-seq is a powerful technology for the study of transcriptome profiles that uses deep-sequencing technologies. Moreover, it may be used for cellular phenotyping and help establishing the etiology of diseases characterized by abnormal splicing patterns. In RNA-Seq, the exact nature of splicing events is buried in the reads that span exon-exon boundaries. The accurate and efficient mapping of these reads to the reference genome is a major challenge. RESULTS We developed PASSion, a pattern growth algorithm-based pipeline for splice site detection in paired-end RNA-Seq reads. Comparing the performance of PASSion to three existing RNA-Seq analysis pipelines, TopHat, MapSplice and HMMSplicer, revealed that PASSion is competitive with these packages. Moreover, the performance of PASSion is not affected by read length and coverage. It performs better than the other three approaches when detecting junctions in highly abundant transcripts. PASSion has the ability to detect junctions that do not have known splicing motifs, which cannot be found by the other tools. Of the two public RNA-Seq datasets, PASSion predicted ≈ 137,000 and 173,000 splicing events, of which on average 82 are known junctions annotated in the Ensembl transcript database and 18% are novel. In addition, our package can discover differential and shared splicing patterns among multiple samples. AVAILABILITY The code and utilities can be freely downloaded from https://trac.nbic.nl/passion and ftp://ftp.sanger.ac.uk/pub/zn1/passion.
Collapse
Affiliation(s)
- Yanju Zhang
- Department of Molecular Epidemiology, Medical Statistics and Bioinformatics, Leiden University Medical Center, Leiden, The Netherlands.
| | | | | | | | | | | |
Collapse
|
12
|
Abstract
The wealth of available protein structural data provides unprecedented opportunity to study and better understand the underlying principles of protein folding and protein structure evolution. A key to achieving this lies in the ability to analyse these data and to organize them in a coherent classification scheme. Over the past years several protein classifications have been developed that aim to group proteins based on their structural relationships. Some of these classification schemes explore the concept of structural neighbourhood (structural continuum), whereas other utilize the notion of protein evolution and thus provide a discrete rather than continuum view of protein structure space. This chapter presents a strategy for classification of proteins with known three-dimensional structure. Steps in the classification process along with basic definitions are introduced. Examples illustrating some fundamental concepts of protein folding and evolution with a special focus on the exceptions to them are presented.
Collapse
|
13
|
Ding L, Wendl MC, Koboldt DC, Mardis ER. Analysis of next-generation genomic data in cancer: accomplishments and challenges. Hum Mol Genet 2010; 19:R188-96. [PMID: 20843826 DOI: 10.1093/hmg/ddq391] [Citation(s) in RCA: 111] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023] Open
Abstract
The application of next-generation sequencing technology has produced a transformation in cancer genomics, generating large data sets that can be analyzed in different ways to answer a multitude of questions about the genomic alterations associated with the disease. Analytical approaches can discover focused mutations such as substitutions and small insertion/deletions, large structural alterations and copy number events. As our capacity to produce such data for multiple cancers of the same type is improving, so are the demands to analyze multiple tumor genomes simultaneously growing. For example, pathway-based analyses that provide the full mutational impact on cellular protein networks and correlation analyses aimed at revealing causal relationships between genomic alterations and clinical presentations are both enabled. As the repertoire of data grows to include mRNA-seq, non-coding RNA-seq and methylation for multiple genomes, our challenge will be to intelligently integrate data types and genomes to produce a coherent picture of the genetic basis of cancer.
Collapse
Affiliation(s)
- Li Ding
- Department of Genetics, The Genome Center at Washington University School of Medicine, 4444 Forest Park Blvd., St Louis, MO 63108, USA
| | | | | | | |
Collapse
|
14
|
Ye K, Schulz MH, Long Q, Apweiler R, Ning Z. Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics 2009; 25:2865-71. [PMID: 19561018 PMCID: PMC2781750 DOI: 10.1093/bioinformatics/btp394] [Citation(s) in RCA: 1535] [Impact Index Per Article: 95.9] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Motivation: There is a strong demand in the genomic community to develop effective algorithms to reliably identify genomic variants. Indel detection using next-gen data is difficult and identification of long structural variations is extremely challenging. Results: We present Pindel, a pattern growth approach, to detect breakpoints of large deletions and medium-sized insertions from paired-end short reads. We use both simulated reads and real data to demonstrate the efficiency of the computer program and accuracy of the results. Availability: The binary code and a short user manual can be freely downloaded from http://www.ebi.ac.uk/∼kye/pindel/. Contact:k.ye@lumc.nl; zn1@sanger.ac.uk
Collapse
Affiliation(s)
- Kai Ye
- EMBL Outstation European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK.
| | | | | | | | | |
Collapse
|
15
|
Structural Variations in Protein Superfamilies: Actin and Tubulin. Mol Biotechnol 2009; 42:49-60. [DOI: 10.1007/s12033-008-9128-6] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2008] [Accepted: 11/14/2008] [Indexed: 11/28/2022]
|
16
|
Zhang K, Fan W, Deininger P, Edwards A, Xu Z, Zhu D. Breaking the computational barrier: a divide-conquer and aggregate based approach for Alu insertion site characterisation. ACTA ACUST UNITED AC 2009; 2:302-22. [PMID: 20090173 DOI: 10.1504/ijcbdd.2009.030763] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023]
Abstract
Insertion site characterisation of Alu elements is an important problem in primate-specific bioinformatics research. Key characteristics of this challenging problem include: data are not in the pre-defined feature vectors for predictive model construction; without any prior knowledge, can we discover the general patterns that could exist and also make biological insights?; how to obtain the compact yet discriminative patterns given a search space of 4(200)? This paper provides an integrated algorithmic framework for fulfilling the above mining tasks. Compared to the benchmark biological study, our results provide a further refined analysis of the patterns involved in Alu insertion. In particular, we acquire a 200nt predictive profile around the primary insertion site which not only contains the widely accepted consensus, but also suggests a longer pattern (T(7)AA[G'A]AATAA. This pattern provides more insight into the favourable sequence variations allowed for preferred binding and cleavage by the L1 ORF2 endonuclease. The proposed method is general enough that can be also applied to other sequence detection problems, such as microRNA target prediction.
Collapse
Affiliation(s)
- Kun Zhang
- Department of Computer Science, Xavier University of Louisiana, New Orleans, Louisiana 70125, USA.
| | | | | | | | | | | |
Collapse
|