1
|
Bateman A, Martin MJ, Orchard S, Magrane M, Agivetova R, Ahmad S, Alpi E, Bowler-Barnett EH, Britto R, Bursteinas B, Bye-A-Jee H, Coetzee R, Cukura A, Da Silva A, Denny P, Dogan T, Ebenezer T, Fan J, Castro LG, Garmiri P, Georghiou G, Gonzales L, Hatton-Ellis E, Hussein A, Ignatchenko A, Insana G, Ishtiaq R, Jokinen P, Joshi V, Jyothi D, Lock A, Lopez R, Luciani A, Luo J, Lussi Y, MacDougall A, Madeira F, Mahmoudy M, Menchi M, Mishra A, Moulang K, Nightingale A, Oliveira CS, Pundir S, Qi G, Raj S, Rice D, Lopez MR, Saidi R, Sampson J, Sawford T, Speretta E, Turner E, Tyagi N, Vasudev P, Volynkin V, Warner K, Watkins X, Zaru R, Zellner H, Bridge A, Poux S, Redaschi N, Aimo L, Argoud-Puy G, Auchincloss A, Axelsen K, Bansal P, Baratin D, Blatter MC, Bolleman J, Boutet E, Breuza L, Casals-Casas C, de Castro E, Echioukh KC, Coudert E, Cuche B, Doche M, Dornevil D, Estreicher A, Famiglietti ML, Feuermann M, Gasteiger E, Gehant S, Gerritsen V, Gos A, Gruaz-Gumowski N, Hinz U, Hulo C, Hyka-Nouspikel N, Jungo F, Keller G, Kerhornou A, Lara V, Le Mercier P, Lieberherr D, Lombardot T, Martin X, Masson P, Morgat A, Neto TB, Paesano S, Pedruzzi I, Pilbout S, Pourcel L, Pozzato M, Pruess M, Rivoire C, Sigrist C, Sonesson K, Stutz A, Sundaram S, Tognolli M, Verbregue L, Wu CH, Arighi CN, Arminski L, Chen C, Chen Y, Garavelli JS, Huang H, Laiho K, McGarvey P, Natale DA, Ross K, Vinayaka CR, Wang Q, Wang Y, Yeh LS, Zhang J, Ruch P, Teodoro D. UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res 2021; 49:D480-D489. [PMID: 33237286 PMCID: PMC7778908 DOI: 10.1093/nar/gkaa1100] [Citation(s) in RCA: 3474] [Impact Index Per Article: 1158.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2020] [Revised: 10/21/2020] [Accepted: 11/02/2020] [Indexed: 02/07/2023] Open
Abstract
The aim of the UniProt Knowledgebase is to provide users with a comprehensive, high-quality and freely accessible set of protein sequences annotated with functional information. In this article, we describe significant updates that we have made over the last two years to the resource. The number of sequences in UniProtKB has risen to approximately 190 million, despite continued work to reduce sequence redundancy at the proteome level. We have adopted new methods of assessing proteome completeness and quality. We continue to extract detailed annotations from the literature to add to reviewed entries and supplement these in unreviewed entries with annotations provided by automated systems such as the newly implemented Association-Rule-Based Annotator (ARBA). We have developed a credit-based publication submission interface to allow the community to contribute publications and annotations to UniProt entries. We describe how UniProtKB responded to the COVID-19 pandemic through expert curation of relevant entries that were rapidly made available to the research community through a dedicated portal. UniProt resources are available under a CC-BY (4.0) license via the web at https://www.uniprot.org/.
Collapse
|
2
|
McGarvey P, Huang J, McCoy M, Orvis J, Katsir Y, Lotringer N, Nesher I, Kavarana M, Sun M, Peet R, Meiri D, Madhavan S. De novo assembly and annotation of transcriptomes from two cultivars of Cannabis sativa with different cannabinoid profiles. Gene 2020; 762:145026. [PMID: 32781193 DOI: 10.1016/j.gene.2020.145026] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2020] [Accepted: 07/31/2020] [Indexed: 10/23/2022]
Abstract
Cannabis has been cultivated for millennia for medicinal, industrial and recreational uses. Our long-term goal is to compare the transcriptomes of cultivars with different cannabinoid profiles for therapeutic purposes. Here we describe the de novo assembly, annotation and initial analysis of two cultivars of Cannabis, a high THC variety and a CBD plus THC variety. Cultivars were grown under different lighting conditions; flower buds were sampled over 71 days. Cannabinoid profiles were determined by ESI-LC/MS. RNA samples were sequenced using the HiSeq4000 platform. Transcriptomes were assembled using the DRAP pipeline and annotated using the BLAST2GO pipeline and other tools. Each transcriptome contained over twenty thousand protein encoding transcripts with ORFs and flanking sequence. Identification of transcripts for cannabinoid pathway and related enzymes showed full-length ORFs that align with the draft genomes of the Purple Kush and Finola cultivars. Two transcripts were found for olivetolic acid cyclase (OAC) that mapped to distinct locations on the Purple Kush genome suggesting multiple genes for OAC are expressed in some cultivars. The ability to make high quality annotated reference transcriptomes in Cannabis or other plants can promote rapid comparative analysis between cultivars and growth conditions in Cannabis and other organisms without annotated genome assemblies.
Collapse
Affiliation(s)
- Peter McGarvey
- Innovation Center for Biomedical Informatics, Georgetown University Medical Center, Washington, DC, USA.
| | - Jiahao Huang
- Innovation Center for Biomedical Informatics, Georgetown University Medical Center, Washington, DC, USA.
| | - Matthew McCoy
- Innovation Center for Biomedical Informatics, Georgetown University Medical Center, Washington, DC, USA.
| | - Joshua Orvis
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, USA
| | - Yael Katsir
- Technion - Israel Institute of Technology, Haifa, Israel
| | | | | | | | - Mingyang Sun
- Teewinot Life Sciences Corporation, Tampa, FL, USA
| | | | - David Meiri
- Technion - Israel Institute of Technology, Haifa, Israel
| | - Subha Madhavan
- Innovation Center for Biomedical Informatics, Georgetown University Medical Center, Washington, DC, USA.
| |
Collapse
|
3
|
Pawliczek P, Patel RY, Ashmore LR, Jackson AR, Bizon C, Nelson T, Powell B, Freimuth RR, Strande N, Shah N, Paithankar S, Wright MW, Dwight S, Zhen J, Landrum M, McGarvey P, Babb L, Plon SE, Milosavljevic A. ClinGen Allele Registry links information about genetic variants. Hum Mutat 2018; 39:1690-1701. [PMID: 30311374 PMCID: PMC6519371 DOI: 10.1002/humu.23637] [Citation(s) in RCA: 31] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2018] [Revised: 08/01/2018] [Accepted: 08/28/2018] [Indexed: 11/18/2022]
Abstract
Effective exchange of information about genetic variants is currently hampered by the lack of readily available globally unique variant identifiers that would enable aggregation of information from different sources. The ClinGen Allele Registry addresses this problem by providing (1) globally unique "canonical" variant identifiers (CAids) on demand, either individually or in large batches; (2) access to variant-identifying information in a searchable Registry; (3) links to allele-related records in many commonly used databases; and (4) services for adding links to information about registered variants in external sources. A core element of the Registry is a canonicalization service, implemented using in-memory sequence alignment-based index, which groups variant identifiers denoting the same nucleotide variant and assigns unique and dereferenceable CAids. More than 650 million distinct variants are currently registered, including those from gnomAD, ExAC, dbSNP, and ClinVar, including a small number of variants registered by Registry users. The Registry is accessible both via a web interface and programmatically via well-documented Hypertext Transfer Protocol (HTTP) Representational State Transfer Application Programming Interface (REST-APIs). For programmatic interoperability, the Registry content is accessible in the JavaScript Object Notation for Linked Data (JSON-LD) format. We present several use cases and demonstrate how the linked information may provide raw material for reasoning about variant's pathogenicity.
Collapse
Affiliation(s)
- Piotr Pawliczek
- Department of Molecular and Human GeneticsBaylor College of MedicineHoustonTexas
| | - Ronak Y. Patel
- Department of Molecular and Human GeneticsBaylor College of MedicineHoustonTexas
| | - Lillian R. Ashmore
- Department of Molecular and Human GeneticsBaylor College of MedicineHoustonTexas
| | - Andrew R. Jackson
- Department of Molecular and Human GeneticsBaylor College of MedicineHoustonTexas
| | - Chris Bizon
- Renaissance Computing InstituteUniversity of North CarolinaChapel HillNorth Carolina
| | - Tristan Nelson
- Geisinger's Autism and Developmental MedicineLewisburgPennsylvania
| | - Bradford Powell
- Department of GeneticsUniversity of North CarolinaChapel HillNorth Carolina
| | | | - Natasha Strande
- Department of GeneticsUniversity of North CarolinaChapel HillNorth Carolina
| | - Neethu Shah
- Department of Molecular and Human GeneticsBaylor College of MedicineHoustonTexas
| | - Sameer Paithankar
- Department of Molecular and Human GeneticsBaylor College of MedicineHoustonTexas
| | - Matt W. Wright
- Department of Biomedical Data SciencesStanford University School of MedicinePalo AltoCalifornia
| | - Selina Dwight
- Department of Biomedical Data SciencesStanford University School of MedicinePalo AltoCalifornia
| | - Jimmy Zhen
- Department of Biomedical Data SciencesStanford University School of MedicinePalo AltoCalifornia
| | - Melissa Landrum
- National Center for Biotechnology InformationNational Institutes of HealthBethesdaMaryland
| | - Peter McGarvey
- Innovation Center for Biomedical InformaticsGeorgetown University Medical CenterWashingtonDistrict of Columbia
| | - Larry Babb
- Sunquest Information Systems CompanyBostonMassachusetts
| | - Sharon E. Plon
- Department of Molecular and Human GeneticsBaylor College of MedicineHoustonTexas
- Department of PediatricsBaylor College of Medicine HoustonTexas
| | | | | |
Collapse
|
4
|
Ren J, Li G, Ross K, Arighi C, McGarvey P, Rao S, Cowart J, Madhavan S, Vijay-Shanker K, Wu CH. iTextMine: integrated text-mining system for large-scale knowledge extraction from the literature. Database (Oxford) 2018; 2018:5255177. [PMID: 30576489 PMCID: PMC6301332 DOI: 10.1093/database/bay128] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/13/2018] [Accepted: 11/09/2018] [Indexed: 02/07/2023]
Abstract
Numerous efforts have been made for developing text-mining tools to extract information from biomedical text automatically. They have assisted in many biological tasks, such as database curation and hypothesis generation. Text-mining tools are usually different from each other in terms of programming language, system dependency and input/output format. There are few previous works that concern the integration of different text-mining tools and their results from large-scale text processing. In this paper, we describe the iTextMine system with an automated workflow to run multiple text-mining tools on large-scale text for knowledge extraction. We employ parallel processing with dockerized text-mining tools with a standardized JSON output format and implement a text alignment algorithm to solve the text discrepancy for result integration. iTextMine presently integrates four relation extraction tools, which have been used to process all the Medline abstracts and PMC open access full-length articles. The website allows users to browse the text evidence and view integrated results for knowledge discovery through a network view. We demonstrate the utilities of iTextMine with two use cases involving the gene PTEN and breast cancer and the gene SATB1.
Collapse
Affiliation(s)
- Jia Ren
- Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE, USA
| | - Gang Li
- Department of Computer and Information Sciences, University of Delaware, Newark, DE, USA
| | - Karen Ross
- Protein Information Resource, Georgetown University Medical Center, Washington, DC, USA
| | - Cecilia Arighi
- Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE, USA.,Department of Computer and Information Sciences, University of Delaware, Newark, DE, USA
| | - Peter McGarvey
- Protein Information Resource, Georgetown University Medical Center, Washington, DC, USA.,Innovation Center For Biomedical Informatics, Georgetown University, Washington, DC, USA
| | - Shruti Rao
- Innovation Center For Biomedical Informatics, Georgetown University, Washington, DC, USA
| | - Julie Cowart
- Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE, USA
| | - Subha Madhavan
- Innovation Center For Biomedical Informatics, Georgetown University, Washington, DC, USA.,Lombardi Comprehensive Cancer Center, Georgetown University Medical Center, Washington, DC, USA
| | - K Vijay-Shanker
- Department of Computer and Information Sciences, University of Delaware, Newark, DE, USA
| | - Cathy H Wu
- Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE, USA.,Department of Computer and Information Sciences, University of Delaware, Newark, DE, USA.,Protein Information Resource, Georgetown University Medical Center, Washington, DC, USA
| |
Collapse
|
5
|
Mahmood ASMA, Rao S, McGarvey P, Wu C, Madhavan S, Vijay-Shanker K. eGARD: Extracting associations between genomic anomalies and drug responses from text. PLoS One 2017; 12:e0189663. [PMID: 29261751 PMCID: PMC5738129 DOI: 10.1371/journal.pone.0189663] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2017] [Accepted: 11/29/2017] [Indexed: 12/25/2022] Open
Abstract
Tumor molecular profiling plays an integral role in identifying genomic anomalies which may help in personalizing cancer treatments, improving patient outcomes and minimizing risks associated with different therapies. However, critical information regarding the evidence of clinical utility of such anomalies is largely buried in biomedical literature. It is becoming prohibitive for biocurators, clinical researchers and oncologists to keep up with the rapidly growing volume and breadth of information, especially those that describe therapeutic implications of biomarkers and therefore relevant for treatment selection. In an effort to improve and speed up the process of manually reviewing and extracting relevant information from literature, we have developed a natural language processing (NLP)-based text mining (TM) system called eGARD (extracting Genomic Anomalies association with Response to Drugs). This system relies on the syntactic nature of sentences coupled with various textual features to extract relations between genomic anomalies and drug response from MEDLINE abstracts. Our system achieved high precision, recall and F-measure of up to 0.95, 0.86 and 0.90, respectively, on annotated evaluation datasets created in-house and obtained externally from PharmGKB. Additionally, the system extracted information that helps determine the confidence level of extraction to support prioritization of curation. Such a system will enable clinical researchers to explore the use of published markers to stratify patients upfront for 'best-fit' therapies and readily generate hypotheses for new clinical trials.
Collapse
Affiliation(s)
- A. S. M. Ashique Mahmood
- Department of Computer and Information Science, University of Delaware, Newark, Delaware, United States of America
- * E-mail:
| | - Shruti Rao
- Innovation Center For Biomedical Informatics, Georgetown University, Washington D.C, United States of America
| | - Peter McGarvey
- Innovation Center For Biomedical Informatics, Georgetown University, Washington D.C, United States of America
- Protein Information Resource, Georgetown University Medical Center, Washington D.C, United States of America
| | - Cathy Wu
- Department of Computer and Information Science, University of Delaware, Newark, Delaware, United States of America
- Protein Information Resource, Georgetown University Medical Center, Washington D.C, United States of America
- Center for Bioinformatics and Computational Biology, University of Delaware, Newark, Delaware, United States of America
| | - Subha Madhavan
- Innovation Center For Biomedical Informatics, Georgetown University, Washington D.C, United States of America
- Lombardi Comprehensive Cancer Center, Georgetown University Medical Center, Washington D.C, United States of America
| | - K. Vijay-Shanker
- Department of Computer and Information Science, University of Delaware, Newark, Delaware, United States of America
| |
Collapse
|
6
|
Luo L, McGarvey P, Madhavan S, Kumar R, Gusev Y, Upadhyay G. Distinct lymphocyte antigens 6 (Ly6) family members Ly6D, Ly6E, Ly6K and Ly6H drive tumorigenesis and clinical outcome. Oncotarget 2017; 7:11165-93. [PMID: 26862846 PMCID: PMC4905465 DOI: 10.18632/oncotarget.7163] [Citation(s) in RCA: 61] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2015] [Accepted: 01/23/2016] [Indexed: 12/21/2022] Open
Abstract
Stem cell antigen-1 (Sca-1) is used to isolate and characterize tumor initiating cell populations from tumors of various murine models [1]. Sca-1 induced disruption of TGF-β signaling is required in vivo tumorigenesis in breast cancer models [2, 3-5]. The role of human Ly6 gene family is only beginning to be appreciated in recent literature [6-9]. To study the significance of Ly6 gene family members, we have visualized one hundred thirty gene expression omnibus (GEO) dataset using Oncomine (Invitrogen) and Georgetown Database of Cancer (G-DOC). This analysis showed that four different members Ly6D, Ly6E, Ly6H or Ly6K have increased gene expressed in bladder, brain and CNS, breast, colorectal, cervical, ovarian, lung, head and neck, pancreatic and prostate cancer than their normal counter part tissues. Increased expression of Ly6D, Ly6E, Ly6H or Ly6K was observed in sub-set of cancer type. The increased expression of Ly6D, Ly6E, Ly6H and Ly6K was found to be associated with poor outcome in ovarian, colorectal, gastric, breast, lung, bladder or brain and CNS as observed by KM plotter and PROGgeneV2 platform. The remarkable findings of increased expression of Ly6 family members and its positive correlation with poor outcome on patient survival in multiple cancer type indicate that Ly6 family members Ly6D, Ly6E, Ly6K and Ly6H will be an important targets in clinical practice as marker of poor prognosis and for developing novel therapeutics in multiple cancer type.
Collapse
Affiliation(s)
- Linlin Luo
- Innovation Center for Biomedical Informatics (ICBI), Georgetown University Medical Center, Washington, District of Columbia 20007, United States of America.,Department of Oncology, Lombardi Comprehensive Cancer Center, Georgetown University Medical Center, Washington, District of Columbia 20007, United States of America
| | - Peter McGarvey
- Innovation Center for Biomedical Informatics (ICBI), Georgetown University Medical Center, Washington, District of Columbia 20007, United States of America.,Department of Oncology, Lombardi Comprehensive Cancer Center, Georgetown University Medical Center, Washington, District of Columbia 20007, United States of America
| | - Subha Madhavan
- Innovation Center for Biomedical Informatics (ICBI), Georgetown University Medical Center, Washington, District of Columbia 20007, United States of America.,Department of Oncology, Lombardi Comprehensive Cancer Center, Georgetown University Medical Center, Washington, District of Columbia 20007, United States of America
| | - Rakesh Kumar
- Department of Biochemistry and Molecular Medicine, School of Medicine and Health Sciences, George Washington University, Washington, District of Columbia 20037, United States of America
| | - Yuriy Gusev
- Innovation Center for Biomedical Informatics (ICBI), Georgetown University Medical Center, Washington, District of Columbia 20007, United States of America.,Department of Oncology, Lombardi Comprehensive Cancer Center, Georgetown University Medical Center, Washington, District of Columbia 20007, United States of America
| | - Geeta Upadhyay
- Innovation Center for Biomedical Informatics (ICBI), Georgetown University Medical Center, Washington, District of Columbia 20007, United States of America.,Department of Oncology, Lombardi Comprehensive Cancer Center, Georgetown University Medical Center, Washington, District of Columbia 20007, United States of America
| |
Collapse
|
7
|
Baraniuk JN, McGarvey P, Suzek BE, Rao S, Lababidi S, Sutherland A, Forshee R, Madhavan S. In silico Analysis of Vaccination Adverse Events. J Allergy Clin Immunol 2015. [DOI: 10.1016/j.jaci.2014.12.1271] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
|
8
|
Jain E, Bairoch A, Duvaud S, Phan I, Redaschi N, Suzek BE, Martin MJ, McGarvey P, Gasteiger E. Infrastructure for the life sciences: design and implementation of the UniProt website. BMC Bioinformatics 2009; 10:136. [PMID: 19426475 PMCID: PMC2686714 DOI: 10.1186/1471-2105-10-136] [Citation(s) in RCA: 337] [Impact Index Per Article: 22.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2009] [Accepted: 05/08/2009] [Indexed: 11/24/2022] Open
Abstract
BACKGROUND The UniProt consortium was formed in 2002 by groups from the Swiss Institute of Bioinformatics (SIB), the European Bioinformatics Institute (EBI) and the Protein Information Resource (PIR) at Georgetown University, and soon afterwards the website http://www.uniprot.org was set up as a central entry point to UniProt resources. Requests to this address were redirected to one of the three organisations' websites. While these sites shared a set of static pages with general information about UniProt, their pages for searching and viewing data were different. To provide users with a consistent view and to cut the cost of maintaining three separate sites, the consortium decided to develop a common website for UniProt. Following several years of intense development and a year of public beta testing, the http://www.uniprot.org domain was switched to the newly developed site described in this paper in July 2008. DESCRIPTION The UniProt consortium is the main provider of protein sequence and annotation data for much of the life sciences community. The http://www.uniprot.org website is the primary access point to this data and to documentation and basic tools for the data. These tools include full text and field-based text search, similarity search, multiple sequence alignment, batch retrieval and database identifier mapping. This paper discusses the design and implementation of the new website, which was released in July 2008, and shows how it improves data access for users with different levels of experience, as well as to machines for programmatic access.http://www.uniprot.org/ is open for both academic and commercial use. The site was built with open source tools and libraries. Feedback is very welcome and should be sent to help@uniprot.org. CONCLUSION The new UniProt website makes accessing and understanding UniProt easier than ever. The two main lessons learned are that getting the basics right for such a data provider website has huge benefits, but is not trivial and easy to underestimate, and that there is no substitute for using empirical data throughout the development process to decide on what is and what is not working for your users.
Collapse
Affiliation(s)
- Eric Jain
- Swiss-Prot Group, Swiss Institute of Bioinformatics, CMU, 1 Michel Servet, 1211 Geneva 4, Switzerland
| | - Amos Bairoch
- Swiss-Prot Group, Swiss Institute of Bioinformatics, CMU, 1 Michel Servet, 1211 Geneva 4, Switzerland
- Department of Structural Biology and Bioinformatics, Faculty of Medicine, University of Geneva, 1 Michel Servet, 1211 Geneva 4, Switzerland
| | - Severine Duvaud
- Swiss-Prot Group, Swiss Institute of Bioinformatics, CMU, 1 Michel Servet, 1211 Geneva 4, Switzerland
| | - Isabelle Phan
- Swiss-Prot Group, Swiss Institute of Bioinformatics, CMU, 1 Michel Servet, 1211 Geneva 4, Switzerland
| | - Nicole Redaschi
- Swiss-Prot Group, Swiss Institute of Bioinformatics, CMU, 1 Michel Servet, 1211 Geneva 4, Switzerland
| | - Baris E Suzek
- Protein Information Resource (PIR), Georgetown University Medical Center, 3300 Whitehaven Street NW, Washington, DC 20007, USA
| | - Maria J Martin
- The EMBL Outstation – European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Peter McGarvey
- Protein Information Resource (PIR), Georgetown University Medical Center, 3300 Whitehaven Street NW, Washington, DC 20007, USA
| | - Elisabeth Gasteiger
- Swiss-Prot Group, Swiss Institute of Bioinformatics, CMU, 1 Michel Servet, 1211 Geneva 4, Switzerland
| |
Collapse
|
9
|
Zhang C, Crasta O, Cammer S, Will R, Kenyon R, Sullivan D, Yu Q, Sun W, Jha R, Liu D, Xue T, Zhang Y, Moore M, McGarvey P, Huang H, Chen Y, Zhang J, Mazumder R, Wu C, Sobral B. An emerging cyberinfrastructure for biodefense pathogen and pathogen-host data. Nucleic Acids Res 2008; 36:D884-91. [PMID: 17984082 PMCID: PMC2239001 DOI: 10.1093/nar/gkm903] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2007] [Revised: 10/04/2007] [Accepted: 10/05/2007] [Indexed: 01/07/2023] Open
Abstract
The NIAID-funded Biodefense Proteomics Resource Center (RC) provides storage, dissemination, visualization and analysis capabilities for the experimental data deposited by seven Proteomics Research Centers (PRCs). The data and its publication is to support researchers working to discover candidates for the next generation of vaccines, therapeutics and diagnostics against NIAID's Category A, B and C priority pathogens. The data includes transcriptional profiles, protein profiles, protein structural data and host-pathogen protein interactions, in the context of the pathogen life cycle in vivo and in vitro. The database has stored and supported host or pathogen data derived from Bacillus, Brucella, Cryptosporidium, Salmonella, SARS, Toxoplasma, Vibrio and Yersinia, human tissue libraries, and mouse macrophages. These publicly available data cover diverse data types such as mass spectrometry, yeast two-hybrid (Y2H), gene expression profiles, X-ray and NMR determined protein structures and protein expression clones. The growing database covers over 23 000 unique genes/proteins from different experiments and organisms. All of the genes/proteins are annotated and integrated across experiments using UniProt Knowledgebase (UniProtKB) accession numbers. The web-interface for the database enables searching, querying and downloading at the level of experiment, group and individual gene(s)/protein(s) via UniProtKB accession numbers or protein function keywords. The system is accessible at http://www.proteomicsresource.org/.
Collapse
Affiliation(s)
- C. Zhang
- Virginia Bioinformatics Institute at Virginia Polytechnic Institute and State University, Washington Street (0477), Blacksburg, VA 24061, Social & Scientific Systems, Inc., 8757 Georgia Avenue, 12th Floor Silver Spring, MD 20910 and Protein Information Resource, Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center, 3300 Whitehaven Street NW, Suite 1200, Washington, DC 20007, USA
| | - O. Crasta
- Virginia Bioinformatics Institute at Virginia Polytechnic Institute and State University, Washington Street (0477), Blacksburg, VA 24061, Social & Scientific Systems, Inc., 8757 Georgia Avenue, 12th Floor Silver Spring, MD 20910 and Protein Information Resource, Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center, 3300 Whitehaven Street NW, Suite 1200, Washington, DC 20007, USA
| | - S. Cammer
- Virginia Bioinformatics Institute at Virginia Polytechnic Institute and State University, Washington Street (0477), Blacksburg, VA 24061, Social & Scientific Systems, Inc., 8757 Georgia Avenue, 12th Floor Silver Spring, MD 20910 and Protein Information Resource, Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center, 3300 Whitehaven Street NW, Suite 1200, Washington, DC 20007, USA
| | - R. Will
- Virginia Bioinformatics Institute at Virginia Polytechnic Institute and State University, Washington Street (0477), Blacksburg, VA 24061, Social & Scientific Systems, Inc., 8757 Georgia Avenue, 12th Floor Silver Spring, MD 20910 and Protein Information Resource, Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center, 3300 Whitehaven Street NW, Suite 1200, Washington, DC 20007, USA
| | - R. Kenyon
- Virginia Bioinformatics Institute at Virginia Polytechnic Institute and State University, Washington Street (0477), Blacksburg, VA 24061, Social & Scientific Systems, Inc., 8757 Georgia Avenue, 12th Floor Silver Spring, MD 20910 and Protein Information Resource, Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center, 3300 Whitehaven Street NW, Suite 1200, Washington, DC 20007, USA
| | - D. Sullivan
- Virginia Bioinformatics Institute at Virginia Polytechnic Institute and State University, Washington Street (0477), Blacksburg, VA 24061, Social & Scientific Systems, Inc., 8757 Georgia Avenue, 12th Floor Silver Spring, MD 20910 and Protein Information Resource, Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center, 3300 Whitehaven Street NW, Suite 1200, Washington, DC 20007, USA
| | - Q. Yu
- Virginia Bioinformatics Institute at Virginia Polytechnic Institute and State University, Washington Street (0477), Blacksburg, VA 24061, Social & Scientific Systems, Inc., 8757 Georgia Avenue, 12th Floor Silver Spring, MD 20910 and Protein Information Resource, Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center, 3300 Whitehaven Street NW, Suite 1200, Washington, DC 20007, USA
| | - W. Sun
- Virginia Bioinformatics Institute at Virginia Polytechnic Institute and State University, Washington Street (0477), Blacksburg, VA 24061, Social & Scientific Systems, Inc., 8757 Georgia Avenue, 12th Floor Silver Spring, MD 20910 and Protein Information Resource, Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center, 3300 Whitehaven Street NW, Suite 1200, Washington, DC 20007, USA
| | - R. Jha
- Virginia Bioinformatics Institute at Virginia Polytechnic Institute and State University, Washington Street (0477), Blacksburg, VA 24061, Social & Scientific Systems, Inc., 8757 Georgia Avenue, 12th Floor Silver Spring, MD 20910 and Protein Information Resource, Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center, 3300 Whitehaven Street NW, Suite 1200, Washington, DC 20007, USA
| | - D. Liu
- Virginia Bioinformatics Institute at Virginia Polytechnic Institute and State University, Washington Street (0477), Blacksburg, VA 24061, Social & Scientific Systems, Inc., 8757 Georgia Avenue, 12th Floor Silver Spring, MD 20910 and Protein Information Resource, Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center, 3300 Whitehaven Street NW, Suite 1200, Washington, DC 20007, USA
| | - T. Xue
- Virginia Bioinformatics Institute at Virginia Polytechnic Institute and State University, Washington Street (0477), Blacksburg, VA 24061, Social & Scientific Systems, Inc., 8757 Georgia Avenue, 12th Floor Silver Spring, MD 20910 and Protein Information Resource, Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center, 3300 Whitehaven Street NW, Suite 1200, Washington, DC 20007, USA
| | - Y. Zhang
- Virginia Bioinformatics Institute at Virginia Polytechnic Institute and State University, Washington Street (0477), Blacksburg, VA 24061, Social & Scientific Systems, Inc., 8757 Georgia Avenue, 12th Floor Silver Spring, MD 20910 and Protein Information Resource, Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center, 3300 Whitehaven Street NW, Suite 1200, Washington, DC 20007, USA
| | - M. Moore
- Virginia Bioinformatics Institute at Virginia Polytechnic Institute and State University, Washington Street (0477), Blacksburg, VA 24061, Social & Scientific Systems, Inc., 8757 Georgia Avenue, 12th Floor Silver Spring, MD 20910 and Protein Information Resource, Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center, 3300 Whitehaven Street NW, Suite 1200, Washington, DC 20007, USA
| | - P. McGarvey
- Virginia Bioinformatics Institute at Virginia Polytechnic Institute and State University, Washington Street (0477), Blacksburg, VA 24061, Social & Scientific Systems, Inc., 8757 Georgia Avenue, 12th Floor Silver Spring, MD 20910 and Protein Information Resource, Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center, 3300 Whitehaven Street NW, Suite 1200, Washington, DC 20007, USA
| | - H. Huang
- Virginia Bioinformatics Institute at Virginia Polytechnic Institute and State University, Washington Street (0477), Blacksburg, VA 24061, Social & Scientific Systems, Inc., 8757 Georgia Avenue, 12th Floor Silver Spring, MD 20910 and Protein Information Resource, Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center, 3300 Whitehaven Street NW, Suite 1200, Washington, DC 20007, USA
| | - Y. Chen
- Virginia Bioinformatics Institute at Virginia Polytechnic Institute and State University, Washington Street (0477), Blacksburg, VA 24061, Social & Scientific Systems, Inc., 8757 Georgia Avenue, 12th Floor Silver Spring, MD 20910 and Protein Information Resource, Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center, 3300 Whitehaven Street NW, Suite 1200, Washington, DC 20007, USA
| | - J. Zhang
- Virginia Bioinformatics Institute at Virginia Polytechnic Institute and State University, Washington Street (0477), Blacksburg, VA 24061, Social & Scientific Systems, Inc., 8757 Georgia Avenue, 12th Floor Silver Spring, MD 20910 and Protein Information Resource, Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center, 3300 Whitehaven Street NW, Suite 1200, Washington, DC 20007, USA
| | - R. Mazumder
- Virginia Bioinformatics Institute at Virginia Polytechnic Institute and State University, Washington Street (0477), Blacksburg, VA 24061, Social & Scientific Systems, Inc., 8757 Georgia Avenue, 12th Floor Silver Spring, MD 20910 and Protein Information Resource, Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center, 3300 Whitehaven Street NW, Suite 1200, Washington, DC 20007, USA
| | - C. Wu
- Virginia Bioinformatics Institute at Virginia Polytechnic Institute and State University, Washington Street (0477), Blacksburg, VA 24061, Social & Scientific Systems, Inc., 8757 Georgia Avenue, 12th Floor Silver Spring, MD 20910 and Protein Information Resource, Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center, 3300 Whitehaven Street NW, Suite 1200, Washington, DC 20007, USA
| | - B. Sobral
- Virginia Bioinformatics Institute at Virginia Polytechnic Institute and State University, Washington Street (0477), Blacksburg, VA 24061, Social & Scientific Systems, Inc., 8757 Georgia Avenue, 12th Floor Silver Spring, MD 20910 and Protein Information Resource, Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center, 3300 Whitehaven Street NW, Suite 1200, Washington, DC 20007, USA
| |
Collapse
|
10
|
Abstract
MOTIVATION Redundant protein sequences in biological databases hinder sequence similarity searches and make interpretation of search results difficult. Clustering of protein sequence space based on sequence similarity helps organize all sequences into manageable datasets and reduces sampling bias and overrepresentation of sequences. RESULTS The UniRef (UniProt Reference Clusters) provide clustered sets of sequences from the UniProt Knowledgebase (UniProtKB) and selected UniProt Archive records to obtain complete coverage of sequence space at several resolutions while hiding redundant sequences. Currently covering >4 million source sequences, the UniRef100 database combines identical sequences and subfragments from any source organism into a single UniRef entry. UniRef90 and UniRef50 are built by clustering UniRef100 sequences at the 90 or 50% sequence identity levels. UniRef100, UniRef90 and UniRef50 yield a database size reduction of approximately 10, 40 and 70%, respectively, from the source sequence set. The reduced redundancy increases the speed of similarity searches and improves detection of distant relationships. UniRef entries contain summary cluster and membership information, including the sequence of a representative protein, member count and common taxonomy of the cluster, the accession numbers of all the merged entries and links to rich functional annotation in UniProtKB to facilitate biological discovery. UniRef has already been applied to broad research areas ranging from genome annotation to proteomics data analysis. AVAILABILITY UniRef is updated biweekly and is available for online search and retrieval at http://www.uniprot.org, as well as for download at ftp://ftp.uniprot.org/pub/databases/uniprot/uniref. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Baris E Suzek
- Protein Information Resource, Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center, Washington, DC 20007, USA.
| | | | | | | | | |
Collapse
|
11
|
McGarvey P, Tousignant M, Geletka L, Cellini F, Kaper JM. The complete sequence of a cucumber mosaic virus from Ixora that is deficient in the replication of satellite RNAs. J Gen Virol 1995; 76 ( Pt 9):2257-70. [PMID: 7561763 DOI: 10.1099/0022-1317-76-9-2257] [Citation(s) in RCA: 20] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2023] Open
Abstract
A cucumber mosaic virus (CMV-Ix) from Ixora is unusual in that it does not support the accumulation of some well-characterized CMV satellite RNAs in plants. CMV-Ix can support a particular satellite RNA variant which causes lethal tomato necrosis when inoculated with other CMV strains but not when inoculated with CMV-Ix. This difference in ability to support accumulation of specific satellite variants is apparent even when their sequences differ by only 10 nucleotides. Electroporation of tomato protoplasts with combinations of CMV-Ix or CMV-1 RNA plus the same satellite variants showed similar differences in accumulation, indicating a defect in satellite RNA replication and not movement or encapsidation. Pseudorecombinant virus infections between CMV-1 and CMV-Ix indicated that the genomic determinants responsible for this phenotype reside on RNA 1 since only combinations with CMV-Ix RNA 1 failed to replicate satellite RNA. The complete genome of CMV-Ix was cloned, sequenced and compared with the genomes of other cucumoviruses. CMV-Ix is most similar in RNA and protein sequence to subgroup 1 CMV-Fny and CMV-Y but slightly less similar than they are to each other. CMV-Ix and all cucumovirus strains sequenced thus far share a domain in the 3' untranslated portion of their genomic RNAs in which 39 of 40 bases are completely conserved.
Collapse
Affiliation(s)
- P McGarvey
- Molecular Plant Pathology Laboratory, PSI, USDA, Beltsville, MD 20705, USA
| | | | | | | | | |
Collapse
|
12
|
McGarvey P, Kaper JM. A simple and rapid method for screening transgenic plants using the PCR. Biotechniques 1991; 11:428-32. [PMID: 1793572] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022] Open
Affiliation(s)
- P McGarvey
- Microbiology and Plant Pathology Laboratory, USDA, Beltsville, MD 20705
| | | |
Collapse
|
13
|
Abstract
The ribosomal RNA operons (rrn operons) of Euglena gracilis chloroplasts contain genes for (in order) 16S rRNA, tRNA(Ile), tRNA(Ala),23S rRNA and 5S rRNA. Major sites of cleavage of the primary rrn transcript were identified by Northern blot hybridization and S1-mapping. The presumptive termini of all of the mature products have now been identified. During initial processing in the chloroplast, the primary transcript is cleaved between the two tRNAs and between the 23S and 5S rRNAs so as to separate the sequences found in the different mature rRNAs. Subsequently the tRNAs are separated from the rRNAs, further trimming provides the remaining proper ends, and the 3'-ends of the tRNAs are added.
Collapse
MESH Headings
- Animals
- Base Sequence
- Blotting, Northern
- Chloroplasts/metabolism
- DNA, Ribosomal/genetics
- Euglena gracilis/genetics
- Molecular Sequence Data
- Operon
- RNA Processing, Post-Transcriptional
- RNA, Ribosomal/genetics
- RNA, Ribosomal, 16S/genetics
- RNA, Ribosomal, 23S/genetics
- RNA, Ribosomal, 5S/genetics
- Restriction Mapping
- Sequence Homology, Nucleic Acid
- Transcription, Genetic
Collapse
Affiliation(s)
- P McGarvey
- Department of Biology, University of Michigan, Ann Arbor 48109
| | | |
Collapse
|
14
|
Abstract
The site of initiation of chloroplast rRNA synthesis was determined by S1-mapping and by sequencing primary rRNA transcripts specifically labeled at their 5'-end. Transcription initiates at a single site 53 nucleotides upstream of the 5'-end of the mature 16S rRNA under all growth conditions examined. The initiation site is within a DNA sequence that is highly homologous to and probably derived from a tRNA gene-region located elsewhere in the chloroplast genome. A nearly identical sequence (102 of 103 nucleotides) is present near the replication origin. The near identity of the two sequences suggests a common mode for control of transcription of the rRNA genes and initiation of chloroplast DNA replication. The related sequence in the tRNA gene-region does not appear to serve as a transcript initiation site.
Collapse
Affiliation(s)
- P McGarvey
- Department of Biology, University of Michigan, Ann Arbor 48109
| | | | | | | | | |
Collapse
|