1
|
Madrigal G, Minhas BF, Catchen J. Klumpy: A tool to evaluate the integrity of long-read genome assemblies and illusive sequence motifs. Mol Ecol Resour 2025; 25:e13982. [PMID: 38800997 PMCID: PMC11646305 DOI: 10.1111/1755-0998.13982] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2024] [Accepted: 05/13/2024] [Indexed: 05/29/2024]
Abstract
The improvement and decreasing costs of third-generation sequencing technologies has widened the scope of biological questions researchers can address with de novo genome assemblies. With the increasing number of reference genomes, validating their integrity with minimal overhead is vital for establishing confident results in their applications. Here, we present Klumpy, a tool for detecting and visualizing both misassembled regions in a genome assembly and genetic elements (e.g. genes) of interest in a set of sequences. By leveraging the initial raw reads in combination with their respective genome assembly, we illustrate Klumpy's utility by investigating antifreeze glycoprotein (afgp) loci across two icefishes, by searching for a reported absent gene in the northern snakehead fish, and by scanning the reference genomes of a mudskipper and bumblebee for misassembled regions. In the two former cases, we were able to provide support for the noncanonical placement of an afgp locus in the icefishes and locate the missing snakehead gene. Furthermore, our genome scans were able identify an unmappable locus in the mudskipper reference genome and identify a putative repetitive element shared among several species of bees.
Collapse
Affiliation(s)
- Giovanni Madrigal
- Department of Evolution, Ecology, and BehaviorUniversity of Illinois at Urbana‐ChampaignUrbanaIllinoisUSA
| | - Bushra Fazal Minhas
- Informatics ProgramUniversity of Illinois at Urbana‐ChampaignUrbanaIllinoisUSA
| | - Julian Catchen
- Department of Evolution, Ecology, and BehaviorUniversity of Illinois at Urbana‐ChampaignUrbanaIllinoisUSA
- Informatics ProgramUniversity of Illinois at Urbana‐ChampaignUrbanaIllinoisUSA
| |
Collapse
|
2
|
Zhang X, Pan W. Exon prediction based on multiscale products of a genomic-inspired multiscale bilateral filtering. PLoS One 2019; 14:e0205050. [PMID: 30897105 PMCID: PMC6428306 DOI: 10.1371/journal.pone.0205050] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2018] [Accepted: 03/05/2019] [Indexed: 11/21/2022] Open
Abstract
Multiscale signal processing techniques such as wavelet filtering have proved to be particularly successful in predicting exon sequences. Traditional wavelet predictor is domain filtering, and enforces exon features by weighting nucleotide values with coefficients. Such a measure performs linear filtering and is not suitable for preserving the short coding exons and the exon-intron boundaries. This paper describes a prediction framework that is capable of non-linearly processing DNA sequences while achieving high prediction rates. There are two key contributions. The first is the introduction of a genomic-inspired multiscale bilateral filtering (MSBF) which exploits both weighting coefficients in the spatial domain and nucleotide similarity in the range. Similarly to wavelet transform, the MSBF is also defined as a weighted sum of nucleotides. The difference is that the MSBF takes into account the variation of nucleotides at a specific codon position. The second contribution is the exploitation of inter-scale correlation in MSBF domain to find the inter-scale dependency on the differences between the exon signal and the background noise. This favourite property is used to sharp the important structures while weakening noise. Three benchmark data sets have been used in the evaluation of considered methods. By comparison with four existing techniques, the prediction results demonstrate that: the proposed method reveals at least improvement of 4.1%, 50.5%, 25.6%, 2.5%, 10.8%, 15.5%, 11.1%, 12.3%, 9.2% and 2.4% on the exons length of 1–24, 25–49, 50–74, 75–99, 100–124, 125–149, 150–174, 175–199, 200–299 and 300–300+, respectively. The MSBF of its nonlinear nature is good at energy compaction, which makes it capable of locating the sharp variations around short exons. The direct scale multiplication of coefficients at several adjacent scales obviously enhanced exon features while the noise contents were suppressed. We show that the non-linear nature and correlation-based property achieved in proposed predictor is greater than that for traditional filtering, which leads to better exon prediction performance. There are some possible applications of this predictor. Its good localization and protection of sharp variations will make the predictor be suitable to perform fault diagnosis of aero-engine.
Collapse
Affiliation(s)
- Xiaolei Zhang
- College of Air Traffic Management, Civil Aviation Flight University of China, Guanghan, P.R. China
| | - Weijun Pan
- College of Air Traffic Management, Civil Aviation Flight University of China, Guanghan, P.R. China
- * E-mail:
| |
Collapse
|
3
|
Marhon SA, Kremer SC. Prediction of Protein Coding Regions Using a Wide-Range Wavelet Window Method. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2016; 13:742-753. [PMID: 26415183 DOI: 10.1109/tcbb.2015.2476789] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Prediction of protein coding regions is an important topic in the field of genomic sequence analysis. Several spectrum-based techniques for the prediction of protein coding regions have been proposed. However, the outstanding issue in most of the proposed techniques is that these techniques depend on an experimentally-selected, predefined value of the window length. In this paper, we propose a new Wide-Range Wavelet Window (WRWW) method for the prediction of protein coding regions. The analysis of the proposed wavelet window shows that its frequency response can adapt its width to accommodate the change in the window length so that it can allow or prevent frequencies other than the basic frequency in the analysis of DNA sequences. This feature makes the proposed window capable of analyzing DNA sequences with a wide range of the window lengths without degradation in the performance. The experimental analysis of applying the WRWW method and other spectrum-based methods to five benchmark datasets has shown that the proposed method outperforms other methods along a wide range of the window lengths. In addition, the experimental analysis has shown that the proposed method is dominant in the prediction of both short and long exons.
Collapse
|
4
|
Zhang X, Shen Z, Zhang G, Shen Y, Chen M, Zhao J, Wu R. Short Exon Detection via Wavelet Transform Modulus Maxima. PLoS One 2016; 11:e0163088. [PMID: 27635656 PMCID: PMC5026382 DOI: 10.1371/journal.pone.0163088] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2016] [Accepted: 09/04/2016] [Indexed: 02/05/2023] Open
Abstract
The detection of short exons is a challenging open problem in the field of bioinformatics. Due to the fact that the weakness of existing model-independent methods lies in their inability to reliably detect small exons, a model-independent method based on the singularity detection with wavelet transform modulus maxima has been developed for detecting short coding sequences (exons) in eukaryotic DNA sequences. In the analysis of our method, the local maxima can capture and characterize singularities of short exons, which helps to yield significant patterns that are rarely observed with the traditional methods. In order to get some information about singularities on the differences between the exon signal and the background noise, the noise level is estimated by filtering the genomic sequence through a notch filter. Meanwhile, a fast method based on a piecewise cubic Hermite interpolating polynomial is applied to reconstruct the wavelet coefficients for improving the computational efficiency. In addition, the output measure of a paired-numerical representation calculated in both forward and reverse directions is used to incorporate a useful DNA structural property. The performances of our approach and other techniques are evaluated on two benchmark data sets. Experimental results demonstrate that the proposed method outperforms all assessed model-independent methods for detecting short exons in terms of evaluation metrics.
Collapse
Affiliation(s)
- Xiaolei Zhang
- Shantou University Medical College, Shantou, P.R. China
| | - Zhiwei Shen
- Department of Radiology, Second Affiliated Hospital of Shantou University Medical College, Shantou, P.R. China
| | - Guishan Zhang
- College of Engineering, Shantou University, Shantou, P.R. China
| | - Yuanyu Shen
- Department of Radiology, Second Affiliated Hospital of Shantou University Medical College, Shantou, P.R. China
| | - Miaomiao Chen
- Department of Radiology, Second Affiliated Hospital of Shantou University Medical College, Shantou, P.R. China
| | - Jiaxiang Zhao
- College of Electronic Information and Optical Engineering, Nankai University, Tianjin, P.R. China
- * E-mail: (JXZ); (RHW)
| | - Renhua Wu
- Department of Radiology, Second Affiliated Hospital of Shantou University Medical College, Shantou, P.R. China
- * E-mail: (JXZ); (RHW)
| |
Collapse
|
5
|
Albuquerque JP, Tobias-Santos V, Rodrigues AC, Mury FB, da Fonseca RN. small ORFs: A new class of essential genes for development. Genet Mol Biol 2015; 38:278-83. [PMID: 26500431 PMCID: PMC4612599 DOI: 10.1590/s1415-475738320150009] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2015] [Accepted: 03/30/2015] [Indexed: 12/02/2022] Open
Abstract
Genes that contain small open reading frames (smORFs) constitute a new group of eukaryotic genes and are expected to represent 5% of the Drosophila melanogaster transcribed genes. In this review we provide a historical perspective of their recent discovery, describe their general mechanism and discuss the importance of smORFs for future genomic and transcriptomic studies. Finally, we discuss the biological role of the most studied smORF so far, the Mlpt/Pri/Tal gene in arthropods. The pleiotropic action of Mlpt/Pri/Tal in D. melanogaster suggests a complex evolutionary scenario that can be used to understand the origins, evolution and integration of smORFs into complex gene regulatory networks.
Collapse
Affiliation(s)
- João Paulo Albuquerque
- Laboratório Integrado de Bioquímica Hatisaburo Masuda, Núcleo em Ecologia e Desenvolvimento Sócio-Ambiental, Universidade Federal de Rio de Janeiro, Macaé, RJ, Brazil
| | - Vitória Tobias-Santos
- Laboratório Integrado de Bioquímica Hatisaburo Masuda, Núcleo em Ecologia e Desenvolvimento Sócio-Ambiental, Universidade Federal de Rio de Janeiro, Macaé, RJ, Brazil
| | - Aline Cáceres Rodrigues
- Laboratório Integrado de Bioquímica Hatisaburo Masuda, Núcleo em Ecologia e Desenvolvimento Sócio-Ambiental, Universidade Federal de Rio de Janeiro, Macaé, RJ, Brazil
| | - Flávia Borges Mury
- Laboratório Integrado de Bioquímica Hatisaburo Masuda, Núcleo em Ecologia e Desenvolvimento Sócio-Ambiental, Universidade Federal de Rio de Janeiro, Macaé, RJ, Brazil. ; Entomologia Molecular, Instituto Nacional de Ciência e Tecnologia, Universidade Federal de Rio de Janeiro, Rio de Janeiro, RJ, Brazil
| | - Rodrigo Nunes da Fonseca
- Laboratório Integrado de Bioquímica Hatisaburo Masuda, Núcleo em Ecologia e Desenvolvimento Sócio-Ambiental, Universidade Federal de Rio de Janeiro, Macaé, RJ, Brazil. ; Entomologia Molecular, Instituto Nacional de Ciência e Tecnologia, Universidade Federal de Rio de Janeiro, Rio de Janeiro, RJ, Brazil
| |
Collapse
|
6
|
-Biao Guo F, Lin Y, -Ling Chen L. Recognition of Protein-coding Genes Based on Z-curve Algorithms. Curr Genomics 2014; 15:95-103. [PMID: 24822027 PMCID: PMC4009845 DOI: 10.2174/1389202915999140328162724] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2013] [Revised: 11/19/2013] [Accepted: 11/20/2013] [Indexed: 01/18/2023] Open
Abstract
Recognition of protein-coding genes, a classical bioinformatics issue, is an absolutely needed step for annotating newly sequenced genomes. The Z-curve algorithm, as one of the most effective methods on this issue, has been successfully applied in annotating or re-annotating many genomes, including those of bacteria, archaea and viruses. Two Z-curve based ab initio gene-finding programs have been developed: ZCURVE (for bacteria and archaea) and ZCURVE_V (for viruses and phages). ZCURVE_C (for 57 bacteria) and Zfisher (for any bacterium) are web servers for re-annotation of bacterial and archaeal genomes. The above four tools can be used for genome annotation or re-annotation, either independently or combined with the other gene-finding programs. In addition to recognizing protein-coding genes and exons, Z-curve algorithms are also effective in recognizing promoters and translation start sites. Here, we summarize the applications of Z-curve algorithms in gene finding and genome annotation.
Collapse
Affiliation(s)
- Feng -Biao Guo
- Center of Bioinformatics and Key Laboratory for NeuroInformation of the Ministry of Education, University of Elec-tronic Science and Technology of China, Chengdu, 610054, China
| | - Yan Lin
- Department of Physics, Tianjin University, Tianjin 300072, China
| | - Ling -Ling Chen
- cCollege of Life Science and Technology, Huazhong Agricultural University, Wuhan, 430070, China
| |
Collapse
|
7
|
Zhang R, Zhang CT. A Brief Review: The Z-curve Theory and its Application in Genome Analysis. Curr Genomics 2014; 15:78-94. [PMID: 24822026 PMCID: PMC4009844 DOI: 10.2174/1389202915999140328162433] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2013] [Revised: 10/16/2013] [Accepted: 10/16/2013] [Indexed: 11/22/2022] Open
Abstract
In theoretical physics, there exist two basic mathematical approaches, algebraic and geometrical methods, which, in most cases, are complementary. In the area of genome sequence analysis, however, algebraic approaches have been widely used, while geometrical approaches have been less explored for a long time. The Z-curve theory is a geometrical approach to genome analysis. The Z-curve is a three-dimensional curve that represents a given DNA sequence in the sense that each can be uniquely reconstructed given the other. The Z-curve, therefore, contains all the information that the corresponding DNA sequence carries. The analysis of a DNA sequence can then be performed through studying the corresponding Z-curve. The Z-curve method has found applications in a wide range of areas in the past two decades, including the identifications of protein-coding genes, replication origins, horizontally-transferred genomic islands, promoters, translational start sides and isochores, as well as studies on phylogenetics, genome visualization and comparative genomics. Here, we review the progress of Z-curve studies from aspects of both theory and applications in genome analysis.
Collapse
Affiliation(s)
- Ren Zhang
- Center for Molecular Medicine and Genetics, Wayne State University Medical School, Detroit, MI 48201, USA
| | - Chun-Ting Zhang
- Department of Physics, Tianjin University, Tianjin 300072, China
| |
Collapse
|
8
|
Chen S, Zhang CY, Song K. Recognizing short coding sequences of prokaryotic genome using a novel iteratively adaptive sparse partial least squares algorithm. Biol Direct 2013; 8:23. [PMID: 24067167 PMCID: PMC3852556 DOI: 10.1186/1745-6150-8-23] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2013] [Accepted: 09/23/2013] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Significant efforts have been made to address the problem of identifying short genes in prokaryotic genomes. However, most known methods are not effective in detecting short genes. Because of the limited information contained in short DNA sequences, it is very difficult to accurately distinguish between protein coding and non-coding sequences in prokaryotic genomes. We have developed a new Iteratively Adaptive Sparse Partial Least Squares (IASPLS) algorithm as the classifier to improve the accuracy of the identification process. RESULTS For testing, we chose the short coding and non-coding sequences from seven prokaryotic organisms. We used seven feature sets (including GC content, Z-curve, etc.) of short genes.In comparison with GeneMarkS, Metagene, Orphelia, and Heuristic Approachs methods, our model achieved the best prediction performance in identification of short prokaryotic genes. Even when we focused on the very short length group ([60-100 nt)), our model provided sensitivity as high as 83.44% and specificity as high as 92.8%. These values are two or three times higher than three of the other methods while Metagene fails to recognize genes in this length range.The experiments also proved that the IASPLS can improve the identification accuracy in comparison with other widely used classifiers, i.e. Logistic, Random Forest (RF) and K nearest neighbors (KNN). The accuracy in using IASPLS was improved 5.90% or more in comparison with the other methods. In addition to the improvements in accuracy, IASPLS required ten times less computer time than using KNN or RF. CONCLUSIONS It is conclusive that our method is preferable for application as an automated method of short gene classification. Its linearity and easily optimized parameters make it practicable for predicting short genes of newly-sequenced or under-studied species.
Collapse
Affiliation(s)
- Sun Chen
- School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, China.
| | | | | |
Collapse
|
9
|
Small open reading frames associated with morphogenesis are hidden in plant genomes. Proc Natl Acad Sci U S A 2013; 110:2395-400. [PMID: 23341627 DOI: 10.1073/pnas.1213958110] [Citation(s) in RCA: 136] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
It is likely that many small ORFs (sORFs; 30-100 amino acids) are missed when genomes are annotated. To overcome this limitation, we identified ∼8,000 sORFs with high coding potential in intergenic regions of the Arabidopsis thaliana genome. However, the question remains as to whether these coding sORFs play functional roles. Using a designed array, we generated an expression atlas for 16 organs and 17 environmental conditions among 7,901 identified coding sORFs. A total of 2,099 coding sORFs were highly expressed under at least one experimental condition, and 571 were significantly conserved in other land plants. A total of 473 coding sORFs were overexpressed; ∼10% (49/473) induced visible phenotypic effects, a proportion that is approximately seven times higher than that of randomly chosen known genes. These results indicate that many coding sORFs hidden in plant genomes are associated with morphogenesis. We believe that the expression atlas will contribute to further study of the roles of sORFs in plants.
Collapse
|
10
|
Abstract
Gene structure data can substantially advance our understanding of metazoan evolution and deliver an independent approach to resolve conflicts among existing hypotheses. Here, we used changes of spliceosomal intron positions as novel phylogenetic marker to reconstruct the animal tree. This kind of data is inferred from orthologous genes containing mutually exclusive introns at pairs of sequence positions in close proximity, so-called near intron pairs (NIPs). NIP data were collected for 48 species and utilized as binary genome-level characters in maximum parsimony (MP) analyses to reconstruct deep metazoan phylogeny. All groupings that were obtained with more than 80% bootstrap support are consistent with currently supported phylogenetic hypotheses. This includes monophyletic Chordata, Vertebrata, Nematoda, Platyhelminthes and Trochozoa. Several other clades such as Deuterostomia, Protostomia, Arthropoda, Ecdysozoa, Spiralia, and Eumetazoa, however, failed to be recovered due to a few problematic taxa such as the mite Ixodesand the warty comb jelly Mnemiopsis. The corresponding unexpected branchings can be explained by the paucity of synapomorphic changes of intron positions shared between some genomes, by the sensitivity of MP analyses to long-branch attraction (LBA), and by the very unequal evolutionary rates of intron loss and intron gain during evolution of the different subclades of metazoans. In addition, we obtained an assemblage of Cnidaria, Porifera, and Placozoa as sister group of Bilateria+Ctenophora with medium support, a disputable, but remarkable result. We conclude that NIPs can be used as phylogenetic characters also within a broader phylogenetic context, given that they have emerged regularly during evolution irrespective of the large variation of intron density across metazoan genomes.
Collapse
Affiliation(s)
- Jörg Lehmann
- Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, University of Leipzig, Härtelstraße 16-18, 04107 Leipzig, Germany
| | | | | |
Collapse
|
11
|
SNR of DNA sequences mapped by general affine transformations of the indicator sequences. J Math Biol 2012; 67:433-51. [DOI: 10.1007/s00285-012-0564-3] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2011] [Revised: 07/02/2012] [Indexed: 10/28/2022]
|
12
|
Song K, Zhang Z, Tong TP, Wu F. Classifier assessment and feature selection for recognizing short coding sequences of human genes. J Comput Biol 2012; 19:251-60. [PMID: 22401589 DOI: 10.1089/cmb.2011.0078] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
With the ever-increasing pace of genome sequencing, there is a great need for fast and accurate computational tools to automatically identify genes in these genomes. Although great progress has been made in the development of gene-finding algorithms during the past decades, there is still room for further improvement. In particular, the issue of recognizing short exons in eukaryotes is still not solved satisfactorily. This article is devoted to assessing various linear and kernel-based classification algorithms and selecting the best combination of Z-curve features for further improvement of the issue. Eight state-of-the-art linear and kernel-based supervised pattern recognition techniques were used to identify the short (21-192 bp) coding sequences of human genes. By measuring the prediction accuracy, the tradeoff between sensitivity and specificity and the time consumption, partial least squares (PLS) and kernel partial least squares (KPLS) algorithms were verified to be the most optimal linear and kernel-based classifiers, respectively. A surprising result was that, by making good use of the interpretability of the PLS and the Z-curve methods, 93 Z-curve features were proved to be the best selective combination. Using them, the average recognition accuracy was improved as high as 7.7% by means of KPLS when compared with what was obtained by the Fisher discriminant analysis using 189 Z-curve variables (Gao and Zhang, 2004 ). The used codes are freely available from the following approaches (implemented in MATLAB and supported on Linux and MS Windows): (1) SVM: http://www.support-vector-machines.org/SVM_soft.html. (2) GP: http://www.gaussianprocess.org. (3) KPLS and KFDA: Taylor, J.S., and Cristianini, N. 2004. Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge, UK. (4) PLS: Wise, B.M., and Gallagher, N.B. 2011. PLS-Toolbox for use with MATLAB: ver 1.5.2. Eigenvector Technologies, Manson, WA. Supplementary Material for this article is available at www.liebertonline.com/cmb.
Collapse
Affiliation(s)
- Kai Song
- School of Chemical Engineering and Technology, Tianjin University, Tianjin, China.
| | | | | | | |
Collapse
|
13
|
Cheng H, Chan WS, Li Z, Wang D, Liu S, Zhou Y. Small open reading frames: current prediction techniques and future prospect. Curr Protein Pept Sci 2012; 12:503-7. [PMID: 21787300 DOI: 10.2174/138920311796957667] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2011] [Revised: 04/01/2011] [Accepted: 05/04/2011] [Indexed: 11/22/2022]
Abstract
Evidence is accumulating that small open reading frames (sORF, <100 codons) play key roles in many important biological processes. Yet, they are generally ignored in gene annotation despite they are far more abundant than the genes with more than 100 codons. Here, we demonstrate that popular homolog search and codon-index techniques perform poorly for small genes relative to that for larger genes, while a method dedicated to sORF discovery has a similar level of accuracy as homology search. The result is largely due to the small dataset of experimentally verified sORF available for homology search and for training ab initio techniques. It highlights the urgent need for both experimental and computational studies in order to further advance the accuracy of sORF prediction.
Collapse
Affiliation(s)
- Haoyu Cheng
- Indiana University School of Informatics, Indiana University-Purdue University and Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, IN 46202, USA
| | | | | | | | | | | |
Collapse
|
14
|
Goli B, Nair AS. The elusive short gene – an ensemble method for recognition for prokaryotic genome. Biochem Biophys Res Commun 2012; 422:36-41. [DOI: 10.1016/j.bbrc.2012.04.090] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2012] [Accepted: 04/17/2012] [Indexed: 10/28/2022]
|
15
|
Large-scale experimental studies show unexpected amino acid effects on protein expression and solubility in vivo in E. coli. MICROBIAL INFORMATICS AND EXPERIMENTATION 2011; 1:6. [PMID: 22587847 PMCID: PMC3372292 DOI: 10.1186/2042-5783-1-6] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/17/2011] [Accepted: 06/27/2011] [Indexed: 11/10/2022]
Abstract
The biochemical and physical factors controlling protein expression level and solubility in vivo remain incompletely characterized. To gain insight into the primary sequence features influencing these outcomes, we performed statistical analyses of results from the high-throughput protein-production pipeline of the Northeast Structural Genomics Consortium. Proteins expressed in E. coli and consistently purified were scored independently for expression and solubility levels. These parameters nonetheless show a very strong positive correlation. We used logistic regressions to determine whether they are systematically influenced by fractional amino acid composition or several bulk sequence parameters including hydrophobicity, sidechain entropy, electrostatic charge, and predicted backbone disorder. Decreasing hydrophobicity correlates with higher expression and solubility levels, but this correlation apparently derives solely from the beneficial effect of three charged amino acids, at least for bacterial proteins. In fact, the three most hydrophobic residues showed very different correlations with solubility level. Leu showed the strongest negative correlation among amino acids, while Ile showed a slightly positive correlation in most data segments. Several other amino acids also had unexpected effects. Notably, Arg correlated with decreased expression and, most surprisingly, solubility of bacterial proteins, an effect only partially attributable to rare codons. However, rare codons did significantly reduce expression despite use of a codon-enhanced strain. Additional analyses suggest that positively but not negatively charged amino acids may reduce translation efficiency in E. coli irrespective of codon usage. While some observed effects may reflect indirect evolutionary correlations, others may reflect basic physicochemical phenomena. We used these results to construct and validate predictors of expression and solubility levels and overall protein usability, and we propose new strategies to be explored for engineering improved protein expression and solubility.
Collapse
|
16
|
Marhon SA, Kremer SC. Gene Prediction Based on DNA Spectral Analysis: A Literature Review. J Comput Biol 2011; 18:639-76. [PMID: 21381961 DOI: 10.1089/cmb.2010.0184] [Citation(s) in RCA: 52] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Affiliation(s)
- Sajid A. Marhon
- School of Computer Science, University of Guelph, Guelph, Ontario, Canada
| | - Stefan C. Kremer
- School of Computer Science, University of Guelph, Guelph, Ontario, Canada
| |
Collapse
|
17
|
Lehmann J, Eisenhardt C, Stadler PF, Krauss V. Some novel intron positions in conserved Drosophila genes are caused by intron sliding or tandem duplication. BMC Evol Biol 2010; 10:156. [PMID: 20500887 PMCID: PMC2891723 DOI: 10.1186/1471-2148-10-156] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2009] [Accepted: 05/26/2010] [Indexed: 01/24/2023] Open
Abstract
BACKGROUND Positions of spliceosomal introns are often conserved between remotely related genes. Introns that reside in non-conserved positions are either novel or remnants of frequent losses of introns in some evolutionary lineages. A recent gain of such introns is difficult to prove. However, introns verified as novel are needed to evaluate contemporary processes of intron gain. RESULTS We identified 25 unambiguous cases of novel intron positions in 31 Drosophila genes that exhibit near intron pairs (NIPs). Here, a NIP consists of an ancient and a novel intron position that are separated by less than 32 nt. Within a single gene, such closely-spaced introns are very unlikely to have coexisted. In most cases, therefore, the ancient intron position must have disappeared in favour of the novel one. A survey for NIPs among 12 Drosophila genomes identifies intron sliding (migration) as one of the more frequent causes of novel intron positions. Other novel introns seem to have been gained by regional tandem duplications of coding sequences containing a proto-splice site. CONCLUSIONS Recent intron gains sometimes appear to have arisen by duplication of exonic sequences and subsequent intronization of one of the copies. Intron migration and exon duplication together may account for a significant amount of novel intron positions in conserved coding sequences.
Collapse
Affiliation(s)
- Jörg Lehmann
- Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, University of Leipzig, 04107 Leipzig, Germany
| | | | | | | |
Collapse
|
18
|
Ma J, Chen X, Wang M, Kang Z. Constructing Physical and Genomic Maps for Puccinia striiformis f. sp. tritici, the Wheat Stripe Rust Pathogen, by Comparing Its EST Sequences to the Genomic Sequence of P. graminis f. sp. tritici, the Wheat Stem Rust Pathogen. Comp Funct Genomics 2010; 2009:302620. [PMID: 20169145 PMCID: PMC2821759 DOI: 10.1155/2009/302620] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2009] [Accepted: 12/20/2009] [Indexed: 01/09/2023] Open
Abstract
The wheat stripe rust fungus, Puccinia striiformis f. sp. tritici (Pst), does not have a known alternate host for sexual reproduction, which makes it impossible to study gene linkages through classic genetic and molecular mapping approaches. In this study, we compared 4,219 Pst expression sequence tags (ESTs) to the genomic sequence of P. graminis f. sp. tritici (Pgt), the wheat stem rust fungus, using BLAST searches. The percentages of homologous genes varied greatly among different Pst libraries with 54.51%, 51.21%, and 13.61% for the urediniospore, germinated urediniospore, and haustorial libraries, respectively, with an average of 33.92%. The 1,432 Pst genes with significant homology with Pgt sequences were grouped into physical groups corresponding to 237 Pgt supercontigs. The physical relationship was demonstrated by 12 pairs (57%), out of 21 selected Pst gene pairs, through PCR screening of a Pst BAC library. The results indicate that the Pgt genome sequence is useful in constructing Pst physical maps.
Collapse
Affiliation(s)
- Jinbiao Ma
- College of Life Sciences, Northwest A&F University, Yangling, Shaanxi 712100, China
- Department of Plant Pathology, Washington State University, Pullman, WA 99164-6430, USA
| | - Xianming Chen
- Department of Plant Pathology, Washington State University, Pullman, WA 99164-6430, USA
- USDA-ARS, Wheat Genetics Quality, Physiology, and Disease Research Unit, Pullman, WA 99164-6430, USA
| | - Meinan Wang
- Department of Plant Pathology, Washington State University, Pullman, WA 99164-6430, USA
| | - Zhensheng Kang
- College of Plant Protection, Northwest A&F University, Yangling, Shaanxi 712100, China
| |
Collapse
|
19
|
Lin MF, Deoras AN, Rasmussen MD, Kellis M. Performance and scalability of discriminative metrics for comparative gene identification in 12 Drosophila genomes. PLoS Comput Biol 2008; 4:e1000067. [PMID: 18421375 PMCID: PMC2291194 DOI: 10.1371/journal.pcbi.1000067] [Citation(s) in RCA: 43] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2007] [Accepted: 03/20/2008] [Indexed: 01/22/2023] Open
Abstract
Comparative genomics of multiple related species is a powerful methodology for the discovery of functional genomic elements, and its power should increase with the number of species compared. Here, we use 12 Drosophila genomes to study the power of comparative genomics metrics to distinguish between protein-coding and non-coding regions. First, we study the relative power of different comparative metrics and their relationship to single-species metrics. We find that even relatively simple multi-species metrics robustly outperform advanced single-species metrics, especially for shorter exons (< or =240 nt), which are common in animal genomes. Moreover, the two capture largely independent features of protein-coding genes, with different sensitivity/specificity trade-offs, such that their combinations lead to even greater discriminatory power. In addition, we study how discovery power scales with the number and phylogenetic distance of the genomes compared. We find that species at a broad range of distances are comparably effective informants for pairwise comparative gene identification, but that these are surpassed by multi-species comparisons at similar evolutionary divergence. In particular, while pairwise discovery power plateaued at larger distances and never outperformed the most advanced single-species metrics, multi-species comparisons continued to benefit even from the most distant species with no apparent saturation. Last, we find that genes in functional categories typically considered fast-evolving can nonetheless be recovered at very high rates using comparative methods. Our results have implications for comparative genomics analyses in any species, including the human.
Collapse
Affiliation(s)
- Michael F. Lin
- Broad Institute of MIT and Harvard University, Cambridge, Massachusetts, United States of America
| | - Ameya N. Deoras
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America
| | - Matthew D. Rasmussen
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America
| | - Manolis Kellis
- Broad Institute of MIT and Harvard University, Cambridge, Massachusetts, United States of America
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America
- * E-mail:
| |
Collapse
|
20
|
Abstract
The past twenty years have witnessed an explosion of biological data in diverse database formats governed by heterogeneous infrastructures. Not only are semantics (attribute terms) different in meaning across databases, but their organization varies widely. Ontologies are a concept imported from computing science to describe different conceptual frameworks that guide the collection, organization and publication of biological data. An ontology is similar to a paradigm but has very strict implications for formatting and meaning in a computational context. The use of ontologies is a means of communicating and resolving semantic and organizational differences between biological databases in order to enhance their integration. The purpose of interoperability (or sharing between divergent storage and semantic protocols) is to allow scientists from around the world to share and communicate with each other. This paper describes the rapid accumulation of biological data, its various organizational structures, and the role that ontologies play in interoperability.
Collapse
Affiliation(s)
- Nadine Schuurman
- Department of Geography, Simon Fraser University RCB 7123, 8888 University Drive, Burnaby, British Columbia, Canada.
| | | |
Collapse
|
21
|
Krauss V, Thümmler C, Georgi F, Lehmann J, Stadler PF, Eisenhardt C. Near Intron Positions Are Reliable Phylogenetic Markers: An Application to Holometabolous Insects. Mol Biol Evol 2008; 25:821-30. [DOI: 10.1093/molbev/msn013] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
|
22
|
Saeys Y, Inza I, Larrañaga P. A review of feature selection techniques in bioinformatics. Bioinformatics 2007; 23:2507-17. [PMID: 17720704 DOI: 10.1093/bioinformatics/btm344] [Citation(s) in RCA: 2016] [Impact Index Per Article: 112.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
Feature selection techniques have become an apparent need in many bioinformatics applications. In addition to the large pool of techniques that have already been developed in the machine learning and data mining fields, specific applications in bioinformatics have led to a wealth of newly proposed techniques. In this article, we make the interested reader aware of the possibilities of feature selection, providing a basic taxonomy of feature selection techniques, and discussing their use, variety and potential in a number of both common as well as upcoming bioinformatics applications.
Collapse
Affiliation(s)
- Yvan Saeys
- Department of Plant Systems Biology, VIB, B-9052 Ghent, Belgium.
| | | | | |
Collapse
|