1
|
Liu M, Hu F, Liu L, Lu X, Li R, Wang J, Wu J, Ma L, Pu Y, Fang Y, Yang G, Wang W, Sun W. Physiological Analysis and Genetic Mapping of Short Hypocotyl Trait in Brassica napus L. Int J Mol Sci 2023; 24:15409. [PMID: 37895090 PMCID: PMC10607371 DOI: 10.3390/ijms242015409] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2023] [Revised: 10/17/2023] [Accepted: 10/19/2023] [Indexed: 10/29/2023] Open
Abstract
Hypocotyl length is a botanical trait that affects the cold tolerance of Brassica napus L. (B. napus). In this study, we constructed an F2 segregating population using the cold-resistant short hypocotyl variety '16VHNTS158' and the cold-sensitive long hypocotyl variety 'Tianyou 2288' as the parents, and BSA-seq was employed to identify candidate genes for hypocotyl length in B. napus. The results of parental differences showed that the average hypocotyl lengths of '16VHNTS158' and 'Tianyou 2288' were 0.41 cm and 0.77 cm at the 5~6 leaf stage, respectively, after different low-temperature treatments, and '16VHNTS158' exhibited lower relative ion leakage rates compared to 'Tianyou 2288'. The contents of indole acetic acid (IAA), gibberellin (GA), and brassinosteroid (BR) in hypocotyls of '16VHNTS158' and 'Tianyou 2288' increased with decreasing temperatures, but the IAA and GA contents were significantly higher than those of 'Tianyou 2288', and the BR content was lower than that of 'Tianyou 2288'. The genetic analysis results indicate that the genetic model for hypocotyl length follows the 2MG-A model. By using SSR molecular markers, a QTL locus associated with hypocotyl length was identified on chromosome C04. The additive effect value of this locus was 0.025, and it accounted for 2.5% of the phenotypic variation. BSA-Seq further localized the major effect QTL locus on chromosome C04, associating it with 41 genomic regions. The total length of this region was 1.06 Mb. Within this region, a total of 20 non-synonymous mutation genes were identified between the parents, and 26 non-synonymous mutation genes were found within the pooled samples. In the reference genome of B. napus, this region was annotated with 24 candidate genes. These annotated genes are predominantly enriched in four pathways: DNA replication, nucleotide excision repair, plant hormone signal transduction, and mismatch repair. The findings of this study provide a theoretical basis for cloning genes related to hypocotyl length in winter rapeseed and their utilization in breeding.
Collapse
Affiliation(s)
| | | | - Lijun Liu
- State Key Laboratory of Aridland Crop Science, College of Agronomy, Gansu Agricultural University, Lanzhou 730070, China; (M.L.)
| | | | | | | | | | | | | | | | | | | | - Wancang Sun
- State Key Laboratory of Aridland Crop Science, College of Agronomy, Gansu Agricultural University, Lanzhou 730070, China; (M.L.)
| |
Collapse
|
2
|
Common Functions of Disordered Proteins across Evolutionary Distant Organisms. Int J Mol Sci 2020; 21:ijms21062105. [PMID: 32204351 PMCID: PMC7139818 DOI: 10.3390/ijms21062105] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2020] [Revised: 03/16/2020] [Accepted: 03/17/2020] [Indexed: 12/14/2022] Open
Abstract
Intrinsically disordered proteins and regions typically lack a well-defined structure and thus fall outside the scope of the classic sequence–structure–function relationship. Hence, classic sequence- or structure-based bioinformatic approaches are often not well suited to identify homology or predict the function of unknown intrinsically disordered proteins. Here, we give selected examples of intrinsic disorder in plant proteins and present how protein function is shared, altered or distinct in evolutionary distant organisms. Furthermore, we explore how examining the specific role of disorder across different phyla can provide a better understanding of the common features that protein disorder contributes to the respective biological mechanism.
Collapse
|
3
|
Tang N, Dehury B, Kepp KP. Computing the Pathogenicity of Alzheimer’s Disease Presenilin 1 Mutations. J Chem Inf Model 2019; 59:858-870. [DOI: 10.1021/acs.jcim.8b00896] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Ning Tang
- Department of Chemistry, Technical University of Denmark, DK-2800 Kongens Lyngby, Denmark
| | - Budheswar Dehury
- Department of Chemistry, Technical University of Denmark, DK-2800 Kongens Lyngby, Denmark
| | - Kasper P. Kepp
- Department of Chemistry, Technical University of Denmark, DK-2800 Kongens Lyngby, Denmark
| |
Collapse
|
4
|
Seifi M, Walter MA. Accurate prediction of functional, structural, and stability changes in PITX2 mutations using in silico bioinformatics algorithms. PLoS One 2018; 13:e0195971. [PMID: 29664915 PMCID: PMC5903617 DOI: 10.1371/journal.pone.0195971] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2018] [Accepted: 04/03/2018] [Indexed: 11/24/2022] Open
Abstract
Mutations in PITX2 have been implicated in several genetic disorders, particularly Axenfeld-Rieger syndrome. In order to determine the most reliable bioinformatics tools to assess the likely pathogenicity of PITX2 variants, the results of bioinformatics predictions were compared to the impact of variants on PITX2 structure and function. The MutPred, Provean, and PMUT bioinformatic tools were found to have the highest performance in predicting the pathogenicity effects of all 18 characterized missense variants in PITX2, all with sensitivity and specificity >93%. Applying these three programs to assess the likely pathogenicity of 13 previously uncharacterized PITX2 missense variants predicted 12/13 variants as deleterious, except A30V which was predicted as benign variant for all programs. Molecular modeling of the PITX2 homoedomain predicts that of the 31 known PITX2 variants, L54Q, F58L, V83F, V83L, W86C, W86S, and R91P alter PITX2's structure. In contrast, the remaining 24 variants are not predicted to change PITX2's structure. The results of molecular modeling, performed on all the PITX2 missense mutations located in the homeodomain, were compared with the findings of eight protein stability programs. CUPSAT was found to be the most reliable in predicting the effect of missense mutations on PITX2 stability. Our results showed that for PITX2, and likely other members of this homeodomain transcription factor family, MutPred, Provean, PMUT, molecular modeling, and CUPSAT can reliably be used to predict PITX2 missense variants pathogenicity.
Collapse
Affiliation(s)
- Morteza Seifi
- Department of Medical Genetics, Faculty of Medicine & Dentistry, University of Alberta, Edmonton, Alberta, Canada
| | - Michael A. Walter
- Department of Medical Genetics, Faculty of Medicine & Dentistry, University of Alberta, Edmonton, Alberta, Canada
| |
Collapse
|
5
|
Shen HB, Yi DL, Yao LX, Yang J, Chou KC. Knowledge-based computational intelligence development for predicting protein secondary structures from sequences. Expert Rev Proteomics 2014; 5:653-62. [DOI: 10.1586/14789450.5.5.653] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
|
6
|
Gao L, Cai M, Shen W, Xiao S, Zhou X, Zhang Y. Engineered fungal polyketide biosynthesis in Pichia pastoris: a potential excellent host for polyketide production. Microb Cell Fact 2013; 12:77. [PMID: 24011431 PMCID: PMC3847973 DOI: 10.1186/1475-2859-12-77] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2013] [Accepted: 09/04/2013] [Indexed: 12/03/2022] Open
Abstract
Background Polyketides are one of the most important classes of secondary metabolites and usually make good drugs. Currently, heterologous production of fungal polyketides for developing a high potential industrial application system with high production capacity and pharmacutical feasibility was still at its infancy. Pichia pastoris is a highly successful system for the high production of a variety of heterologous proteins. In this work, we aim to develop a P. pastoris based in vivo fungal polyketide production system for first time and evaluate its feasibility for future industrial application. Results A recombinant P. pastoris GS115-NpgA-ATX with Aspergillus nidulans phosphopantetheinyl transferase (PPtase) gene npgA and Aspergillus terrus 6-methylsalicylic acid (6-MSA) synthase (6-MSAS) gene atX was constructed. A specific compound was isolated and idenified as 6-MSA by HPLC, LC-MS and NMR. Transcription of both genes were detected. In 5-L bioreactor, the GS115-NpgA-ATX grew well and produced 6-MSA quickly until reached a high value of 2.2 g/L by methanol induction for 20 hours. Thereafter, the cells turned to death ascribing to high concentration of antimicrobial 6-MSA. The distribution of 6-MSA changed that during early and late induction phase it existed more in supernatant while during intermediate stage it mainly located intracellular. Different from 6-MSA production strain, recombinant M. purpureus pksCT expression strains for citrinin intermediate production, no matter PksCT located in cytoplasm or in peroxisomes, did not produce any specfic compound. However, both npgA and pksCT transcripted effectively in cells and western blot analysis proved the expression of PPtase. Then the PPTase was expressed and purified, marked by fluorescent probes, and reacted with purified ACP domain and its mutant ACPm of PksCT. Fluoresence was only observed in ACP but not ACPm, indicating that the PPTase worked well with ACP to make it bioactive holo-ACP. Thus, some other factors may affect polyketide synthesis that include activities of the individual catalytic domains and release of the product from the synthase of PksCT. Conclusions An efficient P. pastoris expression system of fungal polyketides was successfully constructed. It produced a high production of 6-MSA and holds potential for future industrial application of 6-MSA and other fungal polyketides.
Collapse
Affiliation(s)
- Limei Gao
- State Key Laboratory of Bioreactor Engineering, East China University of Science and Technology, Shanghai 200237, China.
| | | | | | | | | | | |
Collapse
|
7
|
Kaushik S, Mutt E, Chellappan A, Sankaran S, Srinivasan N, Sowdhamini R. Improved detection of remote homologues using cascade PSI-BLAST: influence of neighbouring protein families on sequence coverage. PLoS One 2013; 8:e56449. [PMID: 23437136 PMCID: PMC3577913 DOI: 10.1371/journal.pone.0056449] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2012] [Accepted: 01/13/2013] [Indexed: 12/31/2022] Open
Abstract
Background Development of sensitive sequence search procedures for the detection of distant relationships between proteins at superfamily/fold level is still a big challenge. The intermediate sequence search approach is the most frequently employed manner of identifying remote homologues effectively. In this study, examination of serine proteases of prolyl oligopeptidase, rhomboid and subtilisin protein families were carried out using plant serine proteases as queries from two genomes including A. thaliana and O. sativa and 13 other families of unrelated folds to identify the distant homologues which could not be obtained using PSI-BLAST. Methodology/Principal Findings We have proposed to start with multiple queries of classical serine protease members to identify remote homologues in families, using a rigorous approach like Cascade PSI-BLAST. We found that classical sequence based approaches, like PSI-BLAST, showed very low sequence coverage in identifying plant serine proteases. The algorithm was applied on enriched sequence database of homologous domains and we obtained overall average coverage of 88% at family, 77% at superfamily or fold level along with specificity of ∼100% and Mathew’s correlation coefficient of 0.91. Similar approach was also implemented on 13 other protein families representing every structural class in SCOP database. Further investigation with statistical tests, like jackknifing, helped us to better understand the influence of neighbouring protein families. Conclusions/Significance Our study suggests that employment of multiple queries of a family for the Cascade PSI-BLAST searches is useful for predicting distant relationships effectively even at superfamily level. We have proposed a generalized strategy to cover all the distant members of a particular family using multiple query sequences. Our findings reveal that prior selection of sequences as query and the presence of neighbouring families can be important for covering the search space effectively in minimal computational time. This study also provides an understanding of the ‘bridging’ role of related families.
Collapse
Affiliation(s)
- Swati Kaushik
- National Centre for Biological Sciences, Tata Institute of Fundamental Research, Bangalore, Karnataka, India
| | - Eshita Mutt
- National Centre for Biological Sciences, Tata Institute of Fundamental Research, Bangalore, Karnataka, India
| | - Ajithavalli Chellappan
- National Centre for Biological Sciences, Tata Institute of Fundamental Research, Bangalore, Karnataka, India
- School of Chemical and Biotechnology, Shanmugha Arts, Science, Technology & Research Academy, Thanjavur, Tamil Nadu, India
| | - Sandhya Sankaran
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore, Bangalore, India
| | - Narayanaswamy Srinivasan
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore, Bangalore, India
- * E-mail: (NS); (RS)
| | - Ramanathan Sowdhamini
- National Centre for Biological Sciences, Tata Institute of Fundamental Research, Bangalore, Karnataka, India
- * E-mail: (NS); (RS)
| |
Collapse
|
8
|
Polyanovsky VO, Roytberg MA, Tumanyan VG. Comparative analysis of the quality of a global algorithm and a local algorithm for alignment of two sequences. Algorithms Mol Biol 2011; 6:25. [PMID: 22032267 PMCID: PMC3223492 DOI: 10.1186/1748-7188-6-25] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2010] [Accepted: 10/27/2011] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Algorithms of sequence alignment are the key instruments for computer-assisted studies of biopolymers. Obviously, it is important to take into account the "quality" of the obtained alignments, i.e. how closely the algorithms manage to restore the "gold standard" alignment (GS-alignment), which superimposes positions originating from the same position in the common ancestor of the compared sequences. As an approximation of the GS-alignment, a 3D-alignment is commonly used not quite reasonably. Among the currently used algorithms of a pair-wise alignment, the best quality is achieved by using the algorithm of optimal alignment based on affine penalties for deletions (the Smith-Waterman algorithm). Nevertheless, the expedience of using local or global versions of the algorithm has not been studied. RESULTS Using model series of amino acid sequence pairs, we studied the relative "quality" of results produced by local and global alignments versus (1) the relative length of similar parts of the sequences (their "cores") and their nonhomologous parts, and (2) relative positions of the core regions in the compared sequences. We obtained numerical values of the average quality (measured as accuracy and confidence) of the global alignment method and the local alignment method for evolutionary distances between homologous sequence parts from 30 to 240 PAM and for the core length making from 10% to 70% of the total length of the sequences for all possible positions of homologous sequence parts relative to the centers of the sequences. CONCLUSION We revealed criteria allowing to specify conditions of preferred applicability for the local and the global alignment algorithms depending on positions and relative lengths of the cores and nonhomologous parts of the sequences to be aligned. It was demonstrated that when the core part of one sequence was positioned above the core of the other sequence, the global algorithm was more stable at longer evolutionary distances and larger nonhomologous parts than the local algorithm. On the contrary, when the cores were positioned asymmetrically, the local algorithm was more stable at longer evolutionary distances and larger nonhomologous parts than the global algorithm. This opens a possibility for creation of a combined method allowing generation of more accurate alignments.
Collapse
Affiliation(s)
| | - Mikhail A Roytberg
- Institute of Mathematical Problems in Biology, RAS, 142290, Pushchino, Russia
| | | |
Collapse
|
9
|
Yakovlev VV, Roytberg MA. Increasing the accuracy of global alignment of amino acid sequences by constructing a set of alignment candidates. Biophysics (Nagoya-shi) 2010. [DOI: 10.1134/s0006350910060011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
|
10
|
Lobanov MY, Finkel’shtein AV. Analogy-based protein structure prediction: III. Optimizing the combination of the substitution matrix and pseudopotentials used to align protein sequences with spatial structures. Mol Biol 2010. [DOI: 10.1134/s0026893310010140] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
11
|
Lobanov MY, Finkel’shtein AV. Analogy-based protein structure prediction: II. Testing of substitution matrices and pseudopotentials used to align protein sequences with spatial structures. Mol Biol 2009. [DOI: 10.1134/s0026893309040207] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
12
|
Naumoff DG, Carreras M. PSI protein classifier: A new program automating PSI-BLAST search results. Mol Biol 2009. [DOI: 10.1134/s0026893309040189] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
13
|
Skrabanek L, Niv MY. Scan2S: increasing the precision of PROSITE pattern motifs using secondary structure constraints. Proteins 2009; 72:1138-47. [PMID: 18320586 DOI: 10.1002/prot.22008] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Sequence signature databases such as PROSITE, which include protein pattern motifs indicative of a protein's function, are widely used for function prediction studies, cellular localization annotation, and sequence classification. Correct annotation relies on high precision of the motifs. We present a new and general approach for increasing the precision of established protein pattern motifs by including secondary structure constraints (SSCs). We use Scan2S, the first sequence motif-scanning program to optionally include SSCs, to augment PROSITE pattern motifs. The constraints were derived from either the DSSP secondary structure assignment or the PSIPRED predictions for PROSITE-documented true positive hits. The secondary structure-augmented motifs were scanned against all SwissProt sequences, for which secondary structure predictions were precalculated. Against this dataset, motifs with PSIPRED-derived SSCs exhibited improved performance over motifs with DSSP-derived constraints. The precision of 763 of the 782 PSIPRED-augmented motifs remained unchanged or increased compared to the original motifs; 26 motifs showed an absolute precision increase of 10-30%. We provide the complete set of augmented motifs and the Scan2S program at http://physiology.med.cornell.edu/go/scan2s. Our results suggest a general protocol for increasing the precision of protein pattern detection via the inclusion of SSCs.
Collapse
Affiliation(s)
- Lucy Skrabanek
- Department of Physiology and Biophysics, Weill Medical College of Cornell University, New York, New York 10021, USA.
| | | |
Collapse
|
14
|
Sacan A, Toroslu IH, Ferhatosmanoglu H. Integrated search and alignment of protein structures. Bioinformatics 2008; 24:2872-9. [PMID: 18945684 DOI: 10.1093/bioinformatics/btn545] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Identification and comparison of similar three-dimensional (3D) protein structures has become an even greater challenge in the face of the rapidly growing structure databases. Here, we introduce Vorometric, a new method that provides efficient search and alignment of a query protein against a database of protein structures. Voronoi contacts of the protein residues are enriched with the secondary structure information and a metric substitution matrix is developed to allow efficient indexing. The contact hits obtained from a distance-based indexing method are extended to obtain high-scoring segment pairs, which are then used to generate structural alignments. RESULTS Vorometric is the first to address both search and alignment problems in the protein structure databases. The experimental results show that Vorometric is simultaneously effective in retrieving similar protein structures, producing high-quality structure alignments, and identifying cross-fold similarities. Vorometric outperforms current structure retrieval methods in search accuracy, while requiring com-parable running times. Furthermore, the structural superpositions produced are shown to have better quality and coverage, when compared with those of the popular structure alignment tools. AVAILABILITY Vorometric is available as a web service at http://bio.cse.ohio-state.edu/Vorometric
Collapse
Affiliation(s)
- Ahmet Sacan
- Department of Computer Engineering, Middle East Technical University, Ankara, Turkey.
| | | | | |
Collapse
|
15
|
Polyanovsky V, Roytberg MA, Tumanyan VG. Reconstruction of genuine pair-wise sequence alignment. J Comput Biol 2008; 15:379-91. [PMID: 18435572 DOI: 10.1089/cmb.2007.0145] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
In many applications, the algorithmically obtained alignment ideally should restore the "golden standard" (GS) alignment, which superimposes positions originating from the same position of the common ancestor of the compared sequences. The average similarity between the algorithmically obtained and GS alignments ("the quality") is an important characteristic of an alignment algorithm. We proposed to determine the quality of an algorithm, using sequences that were artificially generated in accordance with an appropriate evolution model; the approach was applied to the global version of the Smith-Waterman algorithm (SWA). The quality of SWA is between 97% (for a PAM distance of 60) and 70% (for a PAM distance of 300). The percentage of identical aligned residues is the same for algorithmic and GS alignments. The total length of indels in algorithmic alignments is less than in the GS-mainly due to a substantial decrease in the number of indels in algorithmic alignments.
Collapse
Affiliation(s)
- Valery Polyanovsky
- Engelhardt Institute of Molecular Biology, Russian Academy of Science (RAS), Moscow, Russia
| | | | | |
Collapse
|
16
|
Aydin Z, Altunbasak Y, Pakatci IK, Erdogan H. Training set reduction methods for protein secondary structure prediction in single-sequence condition. ACTA ACUST UNITED AC 2008; 2007:5025-8. [PMID: 18003135 DOI: 10.1109/iembs.2007.4353469] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Orphan proteins are characterized by the lack of significant sequence similarity to database proteins. To infer the functional properties of the orphans, more elaborate techniques that utilize structural information are required. In this regard, the protein structure prediction gains considerable importance. Secondary structure prediction algorithms designed for orphan proteins (also known as single-sequence algorithms) cannot utilize multiple alignments or alignment profiles, which are derived from similar proteins. This is a limiting factor for the prediction accuracy. One way to improve the performance of a single-sequence algorithm is to perform re-training. In this approach, first, the models used by the algorithm are trained by a representative set of proteins and a secondary structure prediction is computed. Then, using a distance measure, the original training set is refined by removing proteins that are dissimilar to the given protein. This step is followed by the re-estimation of the model parameters and the prediction of the secondary structure. In this paper, we compare training set reduction methods that are used to re-train the hidden semi-Markov models employed by the IPSSP algorithm [1]. We found that the composition based reduction method has the highest performance compared to the alignment based and the Chou-Fasman based reduction methods. In addition, threshold-based reduction performed better than the reduction technique that selects the first 80% of the dataset proteins.
Collapse
Affiliation(s)
- Zafer Aydin
- School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA 30332-0250, USA.
| | | | | | | |
Collapse
|
17
|
Otaki JM, Gotoh T, Yamamoto H. Potential implications of availability of short amino acid sequences in proteins: an old and new approach to protein decoding and design. BIOTECHNOLOGY ANNUAL REVIEW 2008; 14:109-41. [PMID: 18606361 DOI: 10.1016/s1387-2656(08)00004-5] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Three-dimensional structure of a protein molecule is primarily determined by its amino acid sequence, and thus the elucidation of general rules embedded in amino acid sequences is of great importance in protein science and engineering. To extract valuable information from sequences, we propose an analytical method in which a protein sequence is considered to be constructed by serial superimpositions of short amino acid sequences of n amino acid sets, especially triplets (3-aa sets). Using the comprehensive nonredundant protein database, we first examined "availability" of all possible combinatorial sets of 8,000 triplet species. Availability score was mathematically defined as an indicator for the relative "preference" or "avoidance" for a given short constituent sequence to be used in protein chain. Availability scores of real proteins were clearly biased against those of randomly generated proteins. We found many triplet species that occurred in the database more than expected or less than expected. Such bias was extended to longer sets, and we found that some species of pentats (5-aa sets) that occurred reasonably frequently in the randomly generated protein population did not occur at all in any real proteins known today. Availability score was dependent on species, potentially serving as a phylogenetic indicator. Furthermore, we suggest possibilities of various biotechnological applications of characteristic short sequences such as human-specific and pathogen-specific short sequences obtained from availability analysis. Availability score was also dependent on secondary structures, potentially serving as a structural indicator. Availability analysis on triplets may be combined with a comprehensive data collection on the varphi and psi peptide-bond angles of the amino acid at the center of each triplet, i.e., a collection of Ramachandran plots for each triplet. These triplet characters, together with other physicochemical data, will provide us with basic information between protein sequence and structure, by which structure prediction and engineering may be greatly facilitated. Availability analysis may also be useful in identifying word processing units in amino acid sequences based on an analogy to natural languages. Together with other approaches, availability analysis will elucidate general rules hidden in the primary sequences and eventually contributes to rebuilding the paradigm of protein science.
Collapse
Affiliation(s)
- Joji M Otaki
- Department of Chemistry, Biology and Marine Science, University of the Ryukyus, Nishihara, Okinawa 903-0213, Japan.
| | | | | |
Collapse
|
18
|
Birzele F, Gewehr JE, Csaba G, Zimmer R. Vorolign--fast structural alignment using Voronoi contacts. Bioinformatics 2007; 23:e205-11. [PMID: 17237093 DOI: 10.1093/bioinformatics/btl294] [Citation(s) in RCA: 50] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
UNLABELLED Vorolign, a fast and flexible structural alignment method for two or more protein structures is introduced. The method aligns protein structures using double dynamic programming and measures the similarity of two residues based on the evolutionary conservation of their corresponding Voronoi-contacts in the protein structure. This similarity function allows aligning protein structures even in cases where structural flexibilities exist. Multiple structural alignments are generated from a set of pairwise alignments using a consistency-based, progressive multiple alignment strategy. RESULTS The performance of Vorolign is evaluated for different applications of protein structure comparison, including automatic family detection as well as pairwise and multiple structure alignment. Vorolign accurately detects the correct family, superfamily or fold of a protein with respect to the SCOP classification on a set of difficult target structures. A scan against a database of >4000 proteins takes on average 1 min per target. The performance of Vorolign in calculating pairwise and multiple alignments is found to be comparable with other pairwise and multiple protein structure alignment methods. AVAILABILITY Vorolign is freely available for academic users as a web server at http://www.bio.ifi.lmu.de/Vorolign
Collapse
Affiliation(s)
- Fabian Birzele
- Practical Informatics and Bioinformatics Group, Department of Informatics, Ludwig-Maximilians-University, Munich, Germany.
| | | | | | | |
Collapse
|
19
|
Ma B, Wu L, Zhang K. Improving the sensitivity and specificity of protein homology search by incorporating predicted secondary structures. J Bioinform Comput Biol 2006; 4:709-20. [PMID: 16960971 DOI: 10.1142/s0219720006002119] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2005] [Revised: 12/30/2005] [Accepted: 12/30/2005] [Indexed: 11/18/2022]
Abstract
In this paper, we improve the homology search performance by the combination of the predicted protein secondary structures and protein sequences. Previous research suggested that the straightforward combination of predicted secondary structures did not improve the homology search performance, mostly because of the errors in the structure prediction. We solved this problem by taking into account the confidence scores output by the prediction programs.
Collapse
Affiliation(s)
- Bin Ma
- Computer Science Department, University of Western Ontario, London, ON N6A 5B7, Canada.
| | | | | |
Collapse
|
20
|
Litvinov II, Lobanov MY, Mironov AA, Finkelshtein AV, Roytberg MA. Information on the secondary structure improves the quality of protein sequence alignment. Mol Biol 2006. [DOI: 10.1134/s0026893306030149] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
21
|
Nozaki Y, Bellgard M. Statistical evaluation and comparison of a pairwise alignment algorithm that a priori assigns the number of gaps rather than employing gap penalties. Bioinformatics 2004; 21:1421-8. [PMID: 15591359 DOI: 10.1093/bioinformatics/bti198] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Although pairwise sequence alignment is essential in comparative genomic sequence analysis, it has proven difficult to precisely determine the gap penalties for a given pair of sequences. A common practice is to employ default penalty values. However, there are a number of problems associated with using gap penalties. First, alignment results can vary depending on the gap penalties, making it difficult to explore appropriate parameters. Second, the statistical significance of an alignment score is typically based on a theoretical model of non-gapped alignments, which may be misleading. Finally, there is no way to control the number of gaps for a given pair of sequences, even if the number of gaps is known in advance. RESULTS In this paper, we develop and evaluate the performance of an alignment technique that allows the researcher to assign a priori set of the number of allowable gaps, rather than using gap penalties. We compare this approach with the Smith-Waterman and Needleman-Wunsch techniques on a set of structurally aligned protein sequences. We demonstrate that this approach outperforms the other techniques, especially for short sequences (56-133 residues) with low similarity (<25%). Further, by employing a statistical measure, we show that it can be used to assess the quality of the alignment in relation to the true alignment with the associated optimal number of gaps. AVAILABILITY The implementation of the described methods SANK_AL is available at http://cbbc.murdoch.edu.au/ CONTACT matthew@cbbc.murdoch.edu.au.
Collapse
Affiliation(s)
- Yasuyuki Nozaki
- Centre for Bioinformatics and Biological Computing, Murdoch University, Murdoch, WA 6150, Australia
| | | |
Collapse
|
22
|
Przybylski D, Rost B. Improving Fold Recognition Without Folds. J Mol Biol 2004; 341:255-69. [PMID: 15312777 DOI: 10.1016/j.jmb.2004.05.041] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2004] [Revised: 05/18/2004] [Accepted: 05/18/2004] [Indexed: 11/21/2022]
Abstract
The most reliable way to align two proteins of unknown structure is through sequence-profile and profile-profile alignment methods. If the structure for one of the two is known, fold recognition methods outperform purely sequence-based alignments. Here, we introduced a novel method that aligns generalised sequence and predicted structure profiles. Using predicted 1D structure (secondary structure and solvent accessibility) significantly improved over sequence-only methods, both in terms of correctly recognising pairs of proteins with different sequences and similar structures and in terms of correctly aligning the pairs. The scores obtained by our generalised scoring matrix followed an extreme value distribution; this yielded accurate estimates of the statistical significance of our alignments. We found that mistakes in 1D structure predictions correlated between proteins from different sequence-structure families. The impact of this surprising result was that our method succeeded in significantly out-performing sequence-only methods even without explicitly using structural information from any of the two. Since AGAPE also outperformed established methods that rely on 3D information, we made it available through. If we solved the problem of CPU-time required to apply AGAPE on millions of proteins, our results could also impact everyday database searches.
Collapse
Affiliation(s)
- Dariusz Przybylski
- CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA.
| | | |
Collapse
|
23
|
Udwary DW, Merski M, Townsend CA. A method for prediction of the locations of linker regions within large multifunctional proteins, and application to a type I polyketide synthase. J Mol Biol 2002; 323:585-98. [PMID: 12381311 PMCID: PMC3400148 DOI: 10.1016/s0022-2836(02)00972-5] [Citation(s) in RCA: 96] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Multifunctional proteins often appear to result from fusion of smaller proteins and in such cases typically can be separated into their ancestral components simply by cleaving the linker regions that separate the domains. Though possibly guided by sequence alignment, structural evidence, or light proteolysis, determination of the locations of linker regions remains empirical. We have developed an algorithm, named UMA, to predict the locations of linker regions in multifunctional proteins by quantification of the conservation of several properties within protein families, and the results agree well with structurally characterized proteins. This technique has been applied to a family of fungal type I iterative polyketide synthases (PKS), allowing prediction of the locations of all of the standard PKS domains, as well as two previously unidentified domains. Using these predictions, we report the cloning of the first fragment from the PKS norsolorinic acid synthase, responsible for biosynthesis of the first isolatable intermediate in aflatoxin production. The expression, light proteolysis and catalytic abilities of this acyl carrier protein-thioesterase didomain are discussed.
Collapse
|
24
|
Schäffer AA, Aravind L, Madden TL, Shavirin S, Spouge JL, Wolf YI, Koonin EV, Altschul SF. Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res 2001; 29:2994-3005. [PMID: 11452024 PMCID: PMC55814 DOI: 10.1093/nar/29.14.2994] [Citation(s) in RCA: 951] [Impact Index Per Article: 41.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2001] [Revised: 05/30/2001] [Accepted: 05/30/2001] [Indexed: 11/13/2022] Open
Abstract
PSI-BLAST is an iterative program to search a database for proteins with distant similarity to a query sequence. We investigated over a dozen modifications to the methods used in PSI-BLAST, with the goal of improving accuracy in finding true positive matches. To evaluate performance we used a set of 103 queries for which the true positives in yeast had been annotated by human experts, and a popular measure of retrieval accuracy (ROC) that can be normalized to take on values between 0 (worst) and 1 (best). The modifications we consider novel improve the ROC score from 0.758 +/- 0.005 to 0.895 +/- 0.003. This does not include the benefits from four modifications we included in the 'baseline' version, even though they were not implemented in PSI-BLAST version 2.0. The improvement in accuracy was confirmed on a small second test set. This test involved analyzing three protein families with curated lists of true positives from the non-redundant protein database. The modification that accounts for the majority of the improvement is the use, for each database sequence, of a position-specific scoring system tuned to that sequence's amino acid composition. The use of composition-based statistics is particularly beneficial for large-scale automated applications of PSI-BLAST.
Collapse
Affiliation(s)
- A A Schäffer
- National Center for Biotechnology Information, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA.
| | | | | | | | | | | | | | | |
Collapse
|
25
|
Current Awareness on Comparative and Functional Genomics. Comp Funct Genomics 2001. [PMCID: PMC2447213 DOI: 10.1002/cfg.58] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022] Open
|