101
|
Abstract
Domains are considered as the basic units of protein folding, evolution, and function. Decomposing each protein into modular domains is thus a basic prerequisite for accurate functional classification of biological molecules. Here, we present ADDA, an automatic algorithm for domain decomposition and clustering of all protein domain families. We use alignments derived from an all-on-all sequence comparison to define domains within protein sequences based on a global maximum likelihood model. In all, 90% of domain boundaries are predicted within 10% of domain size when compared with the manual domain definitions given in the SCOP database. A representative database of 249,264 protein sequences were decomposed into 450,462 domains. These domains were clustered on the basis of sequence similarities into 33,879 domain families containing at least two members with less than 40% sequence identity. Validation against family definitions in the manually curated databases SCOP and PFAM indicates almost perfect unification of various large domain families while contamination by unrelated sequences remains at a low level. The global survey of protein-domain space by ADDA confirms that most large and universal domain families are already described in PFAM and/or SMART. However, a survey of the complete set of mobile modules leads to the identification of 1479 new interesting domain families which shuffle around in multi-domain proteins. The data are publicly available at ftp://ftp.ebi.ac.uk/pub/contrib/heger/adda.
Collapse
Affiliation(s)
- Andreas Heger
- EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK.
| | | |
Collapse
|
102
|
Mooney SD, Klein TE, Altman RB, Trifiro MA, Gottlieb B. A functional analysis of disease-associated mutations in the androgen receptor gene. Nucleic Acids Res 2003; 31:e42. [PMID: 12682377 PMCID: PMC153754 DOI: 10.1093/nar/gng042] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Mutations in the androgen receptor (AR) are associated with a variety of diseases including androgen insensitivity syndrome and prostate cancer, but the way in which these mutations cause disease is poorly understood. We present a method for distinguishing likely disease-causing mutations from mutations that are merely associated with disease but have no causal role. Our method uses a measure of nucleotide conservation, and we find that conservation often correlates with severity of the clinical phenotype. Further, by only including mutations whose pathogenicity has been proven experimentally, this correlation is enhanced in the case of prostate cancer-associated mutations. Our method provides a means for assessing the significance of single nucleotide polymorphisms (SNPs) and cancer-associated mutations.
Collapse
Affiliation(s)
- Sean D Mooney
- Stanford Medical Informatics, Department of Genetics, Stanford University, MSOB X-215, 251 Campus Drive, Stanford, CA 94305-5479, USA.
| | | | | | | | | |
Collapse
|
103
|
Dennis G, Sherman BT, Hosack DA, Yang J, Gao W, Lane HC, Lempicki RA. DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol 2003. [PMID: 12734009 DOI: 10.1186/gb-2003-4-5-p3] [Citation(s) in RCA: 5954] [Impact Index Per Article: 270.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Functional annotation of differentially expressed genes is a necessary and critical step in the analysis of microarray data. The distributed nature of biological knowledge frequently requires researchers to navigate through numerous web-accessible databases gathering information one gene at a time. A more judicious approach is to provide query-based access to an integrated database that disseminates biologically rich information across large datasets and displays graphic summaries of functional information. RESULTS Database for Annotation, Visualization, and Integrated Discovery (DAVID; http://www.david.niaid.nih.gov) addresses this need via four web-based analysis modules: 1) Annotation Tool - rapidly appends descriptive data from several public databases to lists of genes; 2) GoCharts - assigns genes to Gene Ontology functional categories based on user selected classifications and term specificity level; 3) KeggCharts - assigns genes to KEGG metabolic processes and enables users to view genes in the context of biochemical pathway maps; and 4) DomainCharts - groups genes according to PFAM conserved protein domains. CONCLUSIONS Analysis results and graphical displays remain dynamically linked to primary data and external data repositories, thereby furnishing in-depth as well as broad-based data coverage. The functionality provided by DAVID accelerates the analysis of genome-scale datasets by facilitating the transition from data collection to biological meaning.
Collapse
Affiliation(s)
- Glynn Dennis
- Science Applications International Corporation-Frederick, Clinical Services Program, Laboratory of Immunopathogenesis and Bioinformatics, National Cancer Institute at Frederick, MD 21702, USA.
| | | | | | | | | | | | | |
Collapse
|
104
|
Dennis G, Sherman BT, Hosack DA, Yang J, Gao W, Lane HC, Lempicki RA. DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol 2003. [PMID: 12734009 DOI: 10.1186/gb-2003-4-9-r60] [Citation(s) in RCA: 1267] [Impact Index Per Article: 57.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023] Open
Abstract
BACKGROUND Functional annotation of differentially expressed genes is a necessary and critical step in the analysis of microarray data. The distributed nature of biological knowledge frequently requires researchers to navigate through numerous web-accessible databases gathering information one gene at a time. A more judicious approach is to provide query-based access to an integrated database that disseminates biologically rich information across large datasets and displays graphic summaries of functional information. RESULTS Database for Annotation, Visualization, and Integrated Discovery (DAVID; http://www.david.niaid.nih.gov) addresses this need via four web-based analysis modules: 1) Annotation Tool - rapidly appends descriptive data from several public databases to lists of genes; 2) GoCharts - assigns genes to Gene Ontology functional categories based on user selected classifications and term specificity level; 3) KeggCharts - assigns genes to KEGG metabolic processes and enables users to view genes in the context of biochemical pathway maps; and 4) DomainCharts - groups genes according to PFAM conserved protein domains. CONCLUSIONS Analysis results and graphical displays remain dynamically linked to primary data and external data repositories, thereby furnishing in-depth as well as broad-based data coverage. The functionality provided by DAVID accelerates the analysis of genome-scale datasets by facilitating the transition from data collection to biological meaning.
Collapse
Affiliation(s)
- Glynn Dennis
- Science Applications International Corporation-Frederick, Clinical Services Program, Laboratory of Immunopathogenesis and Bioinformatics, National Cancer Institute at Frederick, MD 21702, USA.
| | | | | | | | | | | | | |
Collapse
|
105
|
Abstract
The rapid growth of bio-sequence information has resulted in an increasing demand for reliable methods that group proteins. A few databases with curated alignments of protein families have demonstrated that expert-driven repositories can keep up with the data deluge in the genome era. These original resources implicitly identify domain-like modules in proteins. An increasing number of automatic methods have sprouted over the past few years that cluster the protein universe. Many of these implicitly dissect proteins into structural domain-like fragments. In a very coarse-grained evaluation, some of the automatic methods appear to be on par with expert-driven approaches. However, neither automatic nor manual methods are currently entirely up to the challenges of tasks such as target selection in structural genomics. Thus, we urgently need refined and sustained automatic clustering tools.
Collapse
Affiliation(s)
- Jinfeng Liu
- CUBIC and North East Structural Genomics Consortium, Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA
| | | |
Collapse
|
106
|
Huang H, Barker WC, Chen Y, Wu CH. iProClass: an integrated database of protein family, function and structure information. Nucleic Acids Res 2003; 31:390-2. [PMID: 12520030 PMCID: PMC165491 DOI: 10.1093/nar/gkg044] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
The iProClass database provides comprehensive, value-added descriptions of proteins and serves as a framework for data integration in a distributed networking environment. The protein information in iProClass includes family relationships as well as structural and functional classifications and features. The current version consists of about 830 000 non-redundant PIR-PSD, SWISS-PROT, and TrEMBL proteins organized with more than 36 000 PIR superfamilies, 145 000 families, 4000 domains, 1300 motifs and 550 000 FASTA similarity clusters. It provides rich links to over 50 database of protein sequences, families, functions and pathways, protein-protein interactions, post-translational modifications, protein expressions, structures and structural classifications, genes and genomes, ontologies, literature and taxonomy. Protein and superfamily summary reports present extensive annotation information and include membership statistics and graphical display of domains and motifs. iProClass employs an open and modular architecture for interoperability and scalability. It is implemented in the Oracle object-relational database system and is updated biweekly. The database is freely accessible from the web site at http://pir.georgetown.edu/iproclass/ and searchable by sequence or text string. The data integration in iProClass supports exploration of protein relationships. Such knowledge is fundamental to the understanding of protein evolution, structure and function and crucial to functional genomic and proteomic research.
Collapse
Affiliation(s)
- Hongzhan Huang
- Department of Biochemistry and Molecular Biology, Georgetown University Medical Center, 3900 Reservoir Road, NW, Box 571414, Washington, DC 20057-1414, USA
| | | | | | | |
Collapse
|
107
|
Vlahovicek K, Kaján L, Murvai J, Hegedus Z, Pongor S. The SBASE domain sequence library, release 10: domain architecture prediction. Nucleic Acids Res 2003; 31:403-5. [PMID: 12520034 PMCID: PMC165545 DOI: 10.1093/nar/gkg098] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
SBASE (http://www.icgeb.trieste.it/sbase) is an on-line collection of protein domain sequences and related computational tools designed to facilitate detection of domain homologies based on simple database search. The 10th 'jubilee release' of the SBASE library of protein domain sequences contains 1 052 904 protein sequence segments annotated by structure, function, ligand-binding or cellular topology, clustered into over 6000 domain groups. Domain identification and functional prediction are based on a comparison of BLAST search outputs with a knowledge base of biologically significant similarities extracted from known domain groups. The knowledge base is generated automatically for each domain group from the comparison of within-group ('self') and out-of-group ('non-self') similarities. This is a memory-based approach wherein group-specific similarity functions are automatically learned from the database.
Collapse
Affiliation(s)
- Kristian Vlahovicek
- ICGEB-International Center for Genetic Engineering and Biotechnology, Area Science Park, 34012 Trieste, Italy
| | | | | | | | | |
Collapse
|
108
|
Li WW, Quinn GB, Alexandrov NN, Bourne PE, Shindyalov IN. A comparative proteomics resource: proteins of Arabidopsis thaliana. Genome Biol 2003; 4:R51. [PMID: 12914659 PMCID: PMC193643 DOI: 10.1186/gb-2003-4-8-r51] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2003] [Revised: 05/06/2003] [Accepted: 07/02/2003] [Indexed: 11/11/2022] Open
Abstract
Using an integrative genome annotation pipeline (iGAP) for proteome-wide protein structure and functional domain assignment, we analyzed all the proteins of Arabidopsis thaliana. Three-dimensional structures at the level of the domain are assigned by fold recognition and threading based on a novel fold library that extends common domain classifications. iGAP is being applied to proteins from all available proteomes as part of a comparative proteomics resource. The database is accessible from the web.
Collapse
Affiliation(s)
- Wilfred W Li
- San Diego Supercomputer Center, 9500 Gilman Drive, University of California San Diego, La Jolla, CA 92093-0505, USA.
| | | | | | | | | |
Collapse
|
109
|
DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol 2003. [PMCID: PMC193660] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022] Open
Abstract
DAVID, a web-accessible program that integrates functional genomic annotations with intuitive graphical summaries, has been described and will assist in the interpretation of genome-scale datasets by facilitating the transition from data collection to biological meaning. The distributed nature of biological knowledge poses a major challenge to the interpretation of genome-scale datasets, including those derived from microarray and proteomic studies. This report describes DAVID, a web-accessible program that integrates functional genomic annotations with intuitive graphical summaries. Lists of gene or protein identifiers are rapidly annotated and summarized according to shared categorical data for Gene Ontology, protein domain, and biochemical pathway membership. DAVID assists in the interpretation of genome-scale datasets by facilitating the transition from data collection to biological meaning.
Collapse
|
110
|
Overbeek R, Larsen N, Walunas T, D'Souza M, Pusch G, Selkov E, Liolios K, Joukov V, Kaznadzey D, Anderson I, Bhattacharyya A, Burd H, Gardner W, Hanke P, Kapatral V, Mikhailova N, Vasieva O, Osterman A, Vonstein V, Fonstein M, Ivanova N, Kyrpides N. The ERGO genome analysis and discovery system. Nucleic Acids Res 2003; 31:164-71. [PMID: 12519973 PMCID: PMC165577 DOI: 10.1093/nar/gkg148] [Citation(s) in RCA: 170] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
The ERGO (http://ergo.integratedgenomics.com/ERGO/) genome analysis and discovery suite is an integration of biological data from genomics, biochemistry, high-throughput expression profiling, genetics and peer-reviewed journals to achieve a comprehensive analysis of genes and genomes. Far beyond any conventional systems that facilitate functional assignments, ERGO combines pattern-based analysis with comparative genomics by visualizing genes within the context of regulation, expression profiling, phylogenetic clusters, fusion events, networked cellular pathways and chromosomal neighborhoods of other functionally related genes. The result of this multifaceted approach is to provide an extensively curated database of the largest available integration of genomes, with a vast collection of reconstructed cellular pathways spanning all domains of life. Although access to ERGO is provided only under subscription, it is already widely used by the academic community. The current version of the system integrates 500 genomes from all domains of life in various levels of completion, 403 of which are available for subscription.
Collapse
Affiliation(s)
- Ross Overbeek
- Integrated Genomics Inc., 2201 West Campbell Park Drive, Chicago, IL 60612, USA.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
111
|
Ikeda M, Arai M, Okuno T, Shimizu T. TMPDB: a database of experimentally-characterized transmembrane topologies. Nucleic Acids Res 2003; 31:406-9. [PMID: 12520035 PMCID: PMC165467 DOI: 10.1093/nar/gkg020] [Citation(s) in RCA: 76] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2002] [Revised: 09/09/2002] [Accepted: 09/20/2002] [Indexed: 11/13/2022] Open
Abstract
TMPDB is a database of experimentally-characterized transmembrane (TM) topologies. TMPDB release 6.2 contains a total of 302 TM protein sequences, in which 276 are alpha-helical sequences, 17 beta-stranded, and 9 alpha-helical sequences with short pore-forming helices buried in the membrane. The TM topologies in TMPDB were determined experimentally by means of X-ray crystallography, NMR, gene fusion technique, substituted cysteine accessibility method, N-linked glycosylation experiment and other biochemical methods. TMPDB would be useful as a test and/or training dataset in improving the proposed TM topology prediction methods or developing novel methods with higher performance, and as a guide for both the bioinformaticians and biologists to better understand TM proteins. TMPDB and its subsets are freely available at the following web site: http://bioinfo.si.hirosaki-u.ac.jp/~TMPDB/.
Collapse
Affiliation(s)
- Masami Ikeda
- Department of Electronic Information System Engineering, Faculty of Science and Technology, Hirosaki University, Hirosaki 036-8561, Japan
| | | | | | | |
Collapse
|
112
|
Ivanciuc O, Schein CH, Braun W. SDAP: database and computational tools for allergenic proteins. Nucleic Acids Res 2003; 31:359-62. [PMID: 12520022 PMCID: PMC165457 DOI: 10.1093/nar/gkg010] [Citation(s) in RCA: 186] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
SDAP (Structural Database of Allergenic Proteins) is a web server that provides rapid, cross-referenced access to the sequences, structures and IgE epitopes of allergenic proteins. The SDAP core is a series of CGI scripts that process the user queries, interrogate the database, perform various computations related to protein allergenic determinants and prepare the output HTML pages. The database component of SDAP contains information about the allergen name, source, sequence, structure, IgE epitopes and literature references and easy links to the major protein (PDB, SWISS-PROT/TrEMBL, PIR-ALN, NCBI Taxonomy Browser) and literature (PubMed, MEDLINE) on-line servers. The computational component in SDAP uses an original algorithm based on conserved properties of amino acid side chains to identify regions of known allergens similar to user-supplied peptides or selected from the SDAP database of IgE epitopes. This and other bioinformatics tools can be used to rapidly determine potential cross-reactivities between allergens and to screen novel proteins for the presence of IgE epitopes they may share with known allergens. SDAP is available via the World Wide Web at http://fermi.utmb.edu/SDAP/.
Collapse
Affiliation(s)
- Ovidiu Ivanciuc
- Sealy Center for Structural Biology, Department of Human Biological Chemistry and Genetics, University of Texas Medical Branch, 310 University Boulevard, Galveston, TX 77555-1157, USA
| | | | | |
Collapse
|
113
|
la Cour T, Gupta R, Rapacki K, Skriver K, Poulsen FM, Brunak S. NESbase version 1.0: a database of nuclear export signals. Nucleic Acids Res 2003; 31:393-6. [PMID: 12520031 PMCID: PMC165548 DOI: 10.1093/nar/gkg101] [Citation(s) in RCA: 183] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Protein export from the nucleus is often mediated by a Leucine-rich Nuclear Export Signal (NES). NESbase is a database of experimentally validated Leucine-rich NESs curated from literature. These signals are not annotated in databases such as SWISS-PROT, PIR or PROSITE. Each NESbase entry contains information of whether NES was shown to be necessary and/or sufficient for export, and whether the export was shown to be mediated by the export receptor CRM1. The compiled information was used to make a sequence logo of the Leucine-rich NESs, displaying the conservation of amino acids within a window of 25 residues. Surprisingly, only 36% of the sequences used for the logo fit the widely accepted NES consensus L-x(2,3)-[LIVFM]-x(2,3)-L-x-[LI]. The database is available online at http://www.cbs.dtu.dk/databases/NESbase/.
Collapse
Affiliation(s)
- Tanja la Cour
- Center for Biological Sequence Analysis, Building-208, Technical University of Denmark, DK-2800 Lyngby, Denmark
| | | | | | | | | | | |
Collapse
|
114
|
Mitchell JA, Aronson AR, Mork JG, Folk LC, Humphrey SM, Ward JM. Gene indexing: characterization and analysis of NLM's GeneRIFs. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2003; 2003:460-4. [PMID: 14728215 PMCID: PMC1480312] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 04/28/2023]
Abstract
We present an initial analysis of the National Library of Medicine's (NLM) Gene Indexing initiative. Gene Indexing occurs at the time of indexing for all 4600 journals and over 500,000 articles added to PubMed/MEDLINE each year. Gene Indexing links articles about the basic biology of a gene or protein within eight model organisms to a specific record in the NLM's LocusLink database of gene products. The result is an entry called a Gene Reference Into Function (GeneRIF) within the LocusLink database. We analyzed the numbers of GeneRIFs produced in the first year of GeneRIF production. 27,645 GeneRIFs were produced, pertaining to 9126 loci over eight model organisms. 60% of these were associated with human genes and 27% with mouse genes. About 80% discuss genes with an established MeSH Heading or other MeSH term. We developed a prototype functional alerting system for researchers based on the GeneRIFs, and a strategy to find all of the literature related to genes. We conclude that the Gene Indexing initiative adds considerable value to the life sciences research community.
Collapse
|
115
|
Pruess M, Apweiler R. Bioinformatics Resources for In Silico Proteome Analysis. J Biomed Biotechnol 2003; 2003:231-236. [PMID: 14615630 PMCID: PMC514268 DOI: 10.1155/s1110724303209219] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2002] [Accepted: 12/10/2002] [Indexed: 11/17/2022] Open
Abstract
In the growing field of proteomics, tools for the in silico analysis of proteins and even of whole proteomes are of crucial importance to make best use of the accumulating amount of data. To utilise this data for healthcare and drug development, first the characteristics of proteomes of entire species-mainly the human-have to be understood, before secondly differentiation between individuals can be surveyed. Specialised databases about nucleic acid sequences, protein sequences, protein tertiary structure, genome analysis, and proteome analysis represent useful resources for analysis, characterisation, and classification of protein sequences. Different from most proteomics tools focusing on similarity searches, structure analysis and prediction, detection of specific regions, alignments, data mining, 2D PAGE analysis, or protein modelling, respectively, comprehensive databases like the proteome analysis database benefit from the information stored in different databases and make use of different protein analysis tools to provide computational analysis of whole proteomes.
Collapse
Affiliation(s)
- Manuela Pruess
- EMBL Outstation, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Rolf Apweiler
- EMBL Outstation, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| |
Collapse
|
116
|
Strausberg RL, Feingold EA, Grouse LH, Derge JG, Klausner RD, Collins FS, Wagner L, Shenmen CM, Schuler GD, Altschul SF, Zeeberg B, Buetow KH, Schaefer CF, Bhat NK, Hopkins RF, Jordan H, Moore T, Max SI, Wang J, Hsieh F, Diatchenko L, Marusina K, Farmer AA, Rubin GM, Hong L, Stapleton M, Soares MB, Bonaldo MF, Casavant TL, Scheetz TE, Brownstein MJ, Usdin TB, Toshiyuki S, Carninci P, Prange C, Raha SS, Loquellano NA, Peters GJ, Abramson RD, Mullahy SJ, Bosak SA, McEwan PJ, McKernan KJ, Malek JA, Gunaratne PH, Richards S, Worley KC, Hale S, Garcia AM, Gay LJ, Hulyk SW, Villalon DK, Muzny DM, Sodergren EJ, Lu X, Gibbs RA, Fahey J, Helton E, Ketteman M, Madan A, Rodrigues S, Sanchez A, Whiting M, Madan A, Young AC, Shevchenko Y, Bouffard GG, Blakesley RW, Touchman JW, Green ED, Dickson MC, Rodriguez AC, Grimwood J, Schmutz J, Myers RM, Butterfield YSN, Krzywinski MI, Skalska U, Smailus DE, Schnerch A, Schein JE, Jones SJM, Marra MA. Generation and initial analysis of more than 15,000 full-length human and mouse cDNA sequences. Proc Natl Acad Sci U S A 2002; 99:16899-903. [PMID: 12477932 PMCID: PMC139241 DOI: 10.1073/pnas.242603899] [Citation(s) in RCA: 1365] [Impact Index Per Article: 59.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
The National Institutes of Health Mammalian Gene Collection (MGC) Program is a multiinstitutional effort to identify and sequence a cDNA clone containing a complete ORF for each human and mouse gene. ESTs were generated from libraries enriched for full-length cDNAs and analyzed to identify candidate full-ORF clones, which then were sequenced to high accuracy. The MGC has currently sequenced and verified the full ORF for a nonredundant set of >9,000 human and >6,000 mouse genes. Candidate full-ORF clones for an additional 7,800 human and 3,500 mouse genes also have been identified. All MGC sequences and clones are available without restriction through public databases and clone distribution networks (see http:mgc.nci.nih.gov).
Collapse
|
117
|
Michalickova K, Bader GD, Dumontier M, Lieu H, Betel D, Isserlin R, Hogue CWV. SeqHound: biological sequence and structure database as a platform for bioinformatics research. BMC Bioinformatics 2002; 3:32. [PMID: 12401134 PMCID: PMC138791 DOI: 10.1186/1471-2105-3-32] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2002] [Accepted: 10/25/2002] [Indexed: 11/18/2022] Open
Abstract
BACKGROUND SeqHound has been developed as an integrated biological sequence, taxonomy, annotation and 3-D structure database system. It provides a high-performance server platform for bioinformatics research in a locally-hosted environment. RESULTS SeqHound is based on the National Center for Biotechnology Information data model and programming tools. It offers daily updated contents of all Entrez sequence databases in addition to 3-D structural data and information about sequence redundancies, sequence neighbours, taxonomy, complete genomes, functional annotation including Gene Ontology terms and literature links to PubMed. SeqHound is accessible via a web server through a Perl, C or C++ remote API or an optimized local API. It provides functionality necessary to retrieve specialized subsets of sequences, structures and structural domains. Sequences may be retrieved in FASTA, GenBank, ASN.1 and XML formats. Structures are available in ASN.1, XML and PDB formats. Emphasis has been placed on complete genomes, taxonomy, domain and functional annotation as well as 3-D structural functionality in the API, while fielded text indexing functionality remains under development. SeqHound also offers a streamlined WWW interface for simple web-user queries. CONCLUSIONS The system has proven useful in several published bioinformatics projects such as the BIND database and offers a cost-effective infrastructure for research. SeqHound will continue to develop and be provided as a service of the Blueprint Initiative at the Samuel Lunenfeld Research Institute. The source code and examples are available under the terms of the GNU public license at the Sourceforge site http://sourceforge.net/projects/slritools/ in the SLRI Toolkit.
Collapse
Affiliation(s)
- Katerina Michalickova
- Department of Biochemistry, University of Toronto, Toronto, Ontario, Canada M5S 1A8
- Samuel Lunenfeld Research Institute, 600 University Avenue, Toronto, Ontario, Canada M5G 1X5
| | - Gary D Bader
- Department of Biochemistry, University of Toronto, Toronto, Ontario, Canada M5S 1A8
- Samuel Lunenfeld Research Institute, 600 University Avenue, Toronto, Ontario, Canada M5G 1X5
| | - Michel Dumontier
- Department of Biochemistry, University of Toronto, Toronto, Ontario, Canada M5S 1A8
- Samuel Lunenfeld Research Institute, 600 University Avenue, Toronto, Ontario, Canada M5G 1X5
| | - Hao Lieu
- Samuel Lunenfeld Research Institute, 600 University Avenue, Toronto, Ontario, Canada M5G 1X5
| | - Doron Betel
- Department of Biochemistry, University of Toronto, Toronto, Ontario, Canada M5S 1A8
- Samuel Lunenfeld Research Institute, 600 University Avenue, Toronto, Ontario, Canada M5G 1X5
| | - Ruth Isserlin
- Department of Biochemistry, University of Toronto, Toronto, Ontario, Canada M5S 1A8
- Samuel Lunenfeld Research Institute, 600 University Avenue, Toronto, Ontario, Canada M5G 1X5
| | - Christopher WV Hogue
- Department of Biochemistry, University of Toronto, Toronto, Ontario, Canada M5S 1A8
- Samuel Lunenfeld Research Institute, 600 University Avenue, Toronto, Ontario, Canada M5G 1X5
| |
Collapse
|
118
|
Zhang RG, Grembecka J, Vinokour E, Collart F, Dementieva I, Minor W, Joachimiak A. Structure of Bacillus subtilis YXKO--a member of the UPF0031 family and a putative kinase. J Struct Biol 2002; 139:161-70. [PMID: 12457846 PMCID: PMC2793413 DOI: 10.1016/s1047-8477(02)00532-4] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
We determined the 1.6-A resolution crystal structure of a conserved hypothetical 29.9-kDa protein from the SIGY-CYDD intergenic region encoded by a Bacillus subtilis open reading frame in the YXKO locus. YXKO homologues are broadly distributed and are by and large described as proteins with unknown function. The YXKO protein has an alpha/beta fold and shows high structural homology to the members of a ribokinase-like superfamily. However, YXKO is the only member of this superfamily known to form tetramers. Putative binding sites for adenosine triphosphate (ATP), a substrate, and Mg(2+)-binding sites were revealed in the structure of the protein, based on high structural similarity to ATP-dependent members of the superfamily. Two adjacent monomers contribute residues to the active site. The crystal structure provides valuable information about the YXKO protein's tertiary and quaternary structure, the biochemical function of YXKO and its homologues, and the evolution of its ribokinase-like superfamily.
Collapse
Affiliation(s)
- R-g. Zhang
- Biosciences Division, Structural Biology Center, Argonne National Laboratory, Argonne, Illinois 60439, USA
| | - J. Grembecka
- Department of Molecular Biology and Biological Physics, University of Virginia, Charlottesville, Virginia 22908, USA
| | - E. Vinokour
- Biosciences Division, Structural Biology Center, Argonne National Laboratory, Argonne, Illinois 60439, USA
| | - F. Collart
- Biosciences Division, Structural Biology Center, Argonne National Laboratory, Argonne, Illinois 60439, USA
| | - I. Dementieva
- Biosciences Division, Structural Biology Center, Argonne National Laboratory, Argonne, Illinois 60439, USA
| | - W. Minor
- Department of Molecular Biology and Biological Physics, University of Virginia, Charlottesville, Virginia 22908, USA
| | - A. Joachimiak
- Biosciences Division, Structural Biology Center, Argonne National Laboratory, Argonne, Illinois 60439, USA
- Corresponding author. Fax: +630-252-5517. (A. Joachimiak)
| |
Collapse
|
119
|
|
120
|
Abstract
The explosive growth in biotechnology combined with major advances in information technology has the potential to radically transform immunology in the postgenomics era. Not only do we now have ready access to vast quantities of existing data, but new data with relevance to immunology are being accumulated at an exponential rate. Resources for computational immunology include biological databases and methods for data extraction, comparison, analysis and interpretation. Publicly accessible biological databases of relevance to immunologists number in the hundreds and are growing daily. The ability to efficiently extract and analyse information from these databases is vital for efficient immunology research. Most importantly, a new generation of computational immunology tools enables modelling of peptide transport by the transporter associated with antigen processing (TAP), modelling of antibody binding sites, identification of allergenic motifs and modelling of T-cell receptor serial triggering.
Collapse
Affiliation(s)
- Nikolai Petrovsky
- National BioinformaticsCentre, University of Canberra and National Health Sciences Centre,Canberra Clinical School, Woden, Australian Capital Territory, Australia.
| | | |
Collapse
|