1
|
Schmidt B, Hildebrandt A. From GPUs to AI and quantum: three waves of acceleration in bioinformatics. Drug Discov Today 2024; 29:103990. [PMID: 38663581 DOI: 10.1016/j.drudis.2024.103990] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2023] [Revised: 04/05/2024] [Accepted: 04/17/2024] [Indexed: 05/01/2024]
Abstract
The enormous growth in the amount of data generated by the life sciences is continuously shifting the field from model-driven science towards data-driven science. The need for efficient processing has led to the adoption of massively parallel accelerators such as graphics processing units (GPUs). Consequently, the development of bioinformatics methods nowadays often heavily depends on the effective use of these powerful technologies. Furthermore, progress in computational techniques and architectures continues to be highly dynamic, involving novel deep neural network models and artificial intelligence (AI) accelerators, and potentially quantum processing units in the future. These are expected to be disruptive for the life sciences as a whole and for drug discovery in particular. Here, we identify three waves of acceleration and their applications in a bioinformatics context: (i) GPU computing, (ii) AI and (iii) next-generation quantum computers.
Collapse
Affiliation(s)
- Bertil Schmidt
- Institut für Informatik, Johannes Gutenberg University, Mainz, Germany.
| | | |
Collapse
|
2
|
Title PO, Singhal S, Grundler MC, Costa GC, Pyron RA, Colston TJ, Grundler MR, Prates I, Stepanova N, Jones MEH, Cavalcanti LBQ, Colli GR, Di-Poï N, Donnellan SC, Moritz C, Mesquita DO, Pianka ER, Smith SA, Vitt LJ, Rabosky DL. The macroevolutionary singularity of snakes. Science 2024; 383:918-923. [PMID: 38386744 DOI: 10.1126/science.adh2449] [Citation(s) in RCA: 15] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2023] [Accepted: 01/02/2024] [Indexed: 02/24/2024]
Abstract
Snakes and lizards (Squamata) represent a third of terrestrial vertebrates and exhibit spectacular innovations in locomotion, feeding, and sensory processing. However, the evolutionary drivers of this radiation remain poorly known. We infer potential causes and ultimate consequences of squamate macroevolution by combining individual-based natural history observations (>60,000 animals) with a comprehensive time-calibrated phylogeny that we anchored with genomic data (5400 loci) from 1018 species. Due to shifts in the dynamics of speciation and phenotypic evolution, snakes have transformed the trophic structure of animal communities through the recurrent origin and diversification of specialized predatory strategies. Squamate biodiversity reflects a legacy of singular events that occurred during the early history of snakes and reveals the impact of historical contingency on vertebrate biodiversity.
Collapse
Affiliation(s)
- Pascal O Title
- Department of Ecology and Evolution, Stony Brook University, Stony Brook, NY 11794, USA
- Environmental Resilience Institute, Indiana University, Bloomington, IN 47408, USA
- Museum of Zoology and Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI 48109, USA
| | - Sonal Singhal
- Museum of Zoology and Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI 48109, USA
- Department of Biology, California State University, Dominguez Hills, Carson, CA 90747, USA
| | - Michael C Grundler
- Museum of Zoology and Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI 48109, USA
| | - Gabriel C Costa
- Museum of Zoology and Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI 48109, USA
- Department of Biology and Environmental Sciences, Auburn University at Montgomery, Montgomery, AL 36117, USA
| | - R Alexander Pyron
- Department of Biological Sciences, The George Washington University, Washington, DC 20052, USA
- Department of Vertebrate Zoology, National Museum of Natural History, Smithsonian Institution, Washington, DC, 20560, USA
| | - Timothy J Colston
- Department of Vertebrate Zoology, National Museum of Natural History, Smithsonian Institution, Washington, DC, 20560, USA
- Biology Department, University of Puerto Rico at Mayagüez, Mayagüez 00680, Puerto Rico
| | - Maggie R Grundler
- Museum of Zoology and Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI 48109, USA
- Department of Environmental Science, Policy, and Management, University of California, Berkeley, Berkeley, CA 94720, USA
- Museum of Vertebrate Zoology, University of California, Berkeley, Berkeley, CA 94720, USA
| | - Ivan Prates
- Museum of Zoology and Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI 48109, USA
| | - Natasha Stepanova
- Museum of Zoology and Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI 48109, USA
| | - Marc E H Jones
- Science Group: Fossil Reptiles, Amphibians and Birds Section, Natural History Museum, London SW7 5BD, UK
- Research Department of Cell and Developmental Biology, University College London, London WC1E 6BT, UK
- Biological Sciences, University of Adelaide, Adelaide, SA 5005, Australia
| | - Lucas B Q Cavalcanti
- Departamento de Sistemática e Ecologia, Universidade Federal da Paraíba, João Pessoa, Paraíba 58051-900, Brazil
| | - Guarino R Colli
- Departamento de Zoologia, Universidade de Brasília, Brasília, Distrito Federal 70910-900, Brazil
| | - Nicolas Di-Poï
- Institute of Biotechnology, Helsinki Institute of Life Science, University of Helsinki, 00014 Helsinki, Finland
| | | | - Craig Moritz
- Research School of Biology, The Australian National University, Canberra, ACT 2600, Australia
| | - Daniel O Mesquita
- Departamento de Sistemática e Ecologia, Universidade Federal da Paraíba, João Pessoa, Paraíba 58051-900, Brazil
| | - Eric R Pianka
- Department of Integrative Biology, The University of Texas at Austin, Austin, TX 78712, USA
| | - Stephen A Smith
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI 48109, USA
| | - Laurie J Vitt
- Sam Noble Museum and Department of Biology, University of Oklahoma, Norman, OK, USA
| | - Daniel L Rabosky
- Museum of Zoology and Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI 48109, USA
| |
Collapse
|
3
|
Kwon Y, Rösner H, Zhao W, Selemenakis P, He Z, Kawale AS, Katz JN, Rogers CM, Neal FE, Badamchi Shabestari A, Petrosius V, Singh AK, Joel MZ, Lu L, Holloway SP, Burma S, Mukherjee B, Hromas R, Mazin A, Wiese C, Sørensen CS, Sung P. DNA binding and RAD51 engagement by the BRCA2 C-terminus orchestrate DNA repair and replication fork preservation. Nat Commun 2023; 14:432. [PMID: 36702902 PMCID: PMC9879961 DOI: 10.1038/s41467-023-36211-x] [Citation(s) in RCA: 29] [Impact Index Per Article: 14.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2022] [Accepted: 01/19/2023] [Indexed: 01/27/2023] Open
Abstract
The tumor suppressor BRCA2 participates in DNA double-strand break repair by RAD51-dependent homologous recombination and protects stressed DNA replication forks from nucleolytic attack. We demonstrate that the C-terminal Recombinase Binding (CTRB) region of BRCA2, encoded by gene exon 27, harbors a DNA binding activity. CTRB alone stimulates the DNA strand exchange activity of RAD51 and permits the utilization of RPA-coated ssDNA by RAD51 for strand exchange. Moreover, CTRB functionally synergizes with the Oligonucleotide Binding fold containing DNA binding domain and BRC4 repeat of BRCA2 in RPA-RAD51 exchange on ssDNA. Importantly, we show that the DNA binding and RAD51 interaction attributes of the CTRB are crucial for homologous recombination and protection of replication forks against MRE11-mediated attrition. Our findings shed light on the role of the CTRB region in genome repair, reveal remarkable functional plasticity of BRCA2, and help explain why deletion of Brca2 exon 27 impacts upon embryonic lethality.
Collapse
Affiliation(s)
- Youngho Kwon
- Department of Biochemistry and Structural Biology and Greehey Children's Cancer Research Institute, University of Texas Health Science Center at San Antonio, San Antonio, TX, 78229, USA
| | - Heike Rösner
- Biotech Research and Innovation Centre, Faculty of Health and Medical Sciences, University of Copenhagen, DK-2200, Copenhagen N, Denmark
| | - Weixing Zhao
- Department of Biochemistry and Structural Biology and Greehey Children's Cancer Research Institute, University of Texas Health Science Center at San Antonio, San Antonio, TX, 78229, USA
| | - Platon Selemenakis
- Department of Environmental and Radiological Health Sciences, Colorado State University, Fort Collins, CO, USA
- Department of Cancer Biology, University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Zhuoling He
- Department of Biochemistry and Structural Biology and Greehey Children's Cancer Research Institute, University of Texas Health Science Center at San Antonio, San Antonio, TX, 78229, USA
| | - Ajinkya S Kawale
- Department of Biochemistry and Structural Biology and Greehey Children's Cancer Research Institute, University of Texas Health Science Center at San Antonio, San Antonio, TX, 78229, USA
- Massachusetts General Hospital Cancer Center, Harvard Medical School, Charlestown, MA, 02129, USA
| | - Jeffrey N Katz
- Department of Biochemistry and Structural Biology and Greehey Children's Cancer Research Institute, University of Texas Health Science Center at San Antonio, San Antonio, TX, 78229, USA
| | - Cody M Rogers
- Department of Biochemistry and Structural Biology and Greehey Children's Cancer Research Institute, University of Texas Health Science Center at San Antonio, San Antonio, TX, 78229, USA
| | - Francisco E Neal
- Department of Biochemistry and Structural Biology and Greehey Children's Cancer Research Institute, University of Texas Health Science Center at San Antonio, San Antonio, TX, 78229, USA
| | - Aida Badamchi Shabestari
- Department of Biochemistry and Structural Biology and Greehey Children's Cancer Research Institute, University of Texas Health Science Center at San Antonio, San Antonio, TX, 78229, USA
| | - Valdemaras Petrosius
- Biotech Research and Innovation Centre, Faculty of Health and Medical Sciences, University of Copenhagen, DK-2200, Copenhagen N, Denmark
| | - Akhilesh K Singh
- Department of Molecular Biophysics and Biochemistry, Yale University School of Medicine, New Haven, CT, USA
- GentiBio Inc., 150 Cambridgepark Dr, Cambridge, MA, 02140, USA
| | - Marina Z Joel
- Department of Molecular Biophysics and Biochemistry, Yale University School of Medicine, New Haven, CT, USA
- Johns Hopkins University School of Medicine, Baltimore, MD, 21205, USA
| | - Lucy Lu
- Department of Molecular Biophysics and Biochemistry, Yale University School of Medicine, New Haven, CT, USA
| | - Stephen P Holloway
- Department of Biochemistry and Structural Biology and Greehey Children's Cancer Research Institute, University of Texas Health Science Center at San Antonio, San Antonio, TX, 78229, USA
| | - Sandeep Burma
- Department of Biochemistry and Structural Biology and Greehey Children's Cancer Research Institute, University of Texas Health Science Center at San Antonio, San Antonio, TX, 78229, USA
- Department of Neurosurgery, University of Texas Health Science Center at San Antonio, San Antonio, TX, 78229, USA
| | - Bipasha Mukherjee
- Department of Neurosurgery, University of Texas Health Science Center at San Antonio, San Antonio, TX, 78229, USA
| | - Robert Hromas
- Department of Medicine, University of Texas Health at San Antonio, 7703 Floyd Curl Drive, San Antonio, TX, 78229, USA
| | - Alexander Mazin
- Department of Biochemistry and Structural Biology and Greehey Children's Cancer Research Institute, University of Texas Health Science Center at San Antonio, San Antonio, TX, 78229, USA
| | - Claudia Wiese
- Department of Environmental and Radiological Health Sciences, Colorado State University, Fort Collins, CO, USA.
| | - Claus S Sørensen
- Biotech Research and Innovation Centre, Faculty of Health and Medical Sciences, University of Copenhagen, DK-2200, Copenhagen N, Denmark.
| | - Patrick Sung
- Department of Biochemistry and Structural Biology and Greehey Children's Cancer Research Institute, University of Texas Health Science Center at San Antonio, San Antonio, TX, 78229, USA.
| |
Collapse
|
4
|
Kuang M, Zhang Y, Lam TW, Ting HF. MLProbs: A Data-Centric Pipeline for Better Multiple Sequence Alignment. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:524-533. [PMID: 35120007 DOI: 10.1109/tcbb.2022.3148382] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
In this paper, we explore using the data-centric approach to tackle the Multiple Sequence Alignment (MSA) construction problem. Unlike the algorithm-centric approach, which reduces the construction problem to a combinatorial optimization problem based on an abstract mathematical model, the data-centric approach explores using classification models trained from existing benchmark data to guide the construction. We identified two simple classifications to help us choose a better alignment tool and determine whether and how much to carry out realignment. We show that shallow machine-learning algorithms suffice to train sensitive models for these classifications. Based on these models, we implemented a new multiple sequence alignment pipeline, called MLProbs. Compared with 10 other popular alignment tools over four benchmark databases (namely, BAliBASE, OXBench, OXBench-X and SABMark), MLProbs consistently gives the highest TC score. More importantly, MLProbs shows non-trivial improvement for protein families with low similarity; in particular, when evaluated against the 1,356 protein families with similarity ≤ 50%, MLProbs achieves a TC score of 56.93, while the next best three tools are in the range of [55.41, 55.91] (increased by more than 1.8%). We also compared the performance of MLProbs and other MSA tools in two real-life applications - Phylogenetic Tree Construction Analysis and Protein Secondary Structure Prediction - and MLProbs also had the best performance. In our study, we used only shallow machine-learning algorithms to train our models. It would be interesting to study whether deep-learning methods can help make further improvements, so we suggest some possible research directions in the conclusion section.
Collapse
|
5
|
Kreitmeier M, Ardern Z, Abele M, Ludwig C, Scherer S, Neuhaus K. Spotlight on alternative frame coding: Two long overlapping genes in Pseudomonas aeruginosa are translated and under purifying selection. iScience 2022; 25:103844. [PMID: 35198897 PMCID: PMC8850804 DOI: 10.1016/j.isci.2022.103844] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2021] [Revised: 10/14/2021] [Accepted: 01/27/2022] [Indexed: 12/13/2022] Open
Abstract
The existence of overlapping genes (OLGs) with significant coding overlaps revolutionizes our understanding of genomic complexity. We report two exceptionally long (957 nt and 1536 nt), evolutionarily novel, translated antisense open reading frames (ORFs) embedded within annotated genes in the pathogenic Gram-negative bacterium Pseudomonas aeruginosa. Both OLG pairs show sequence features consistent with being genes and transcriptional signals in RNA sequencing. Translation of both OLGs was confirmed by ribosome profiling and mass spectrometry. Quantitative proteomics of samples taken during different phases of growth revealed regulation of protein abundances, implying biological functionality. Both OLGs are taxonomically restricted, and likely arose by overprinting within the genus. Evidence for purifying selection further supports functionality. The OLGs reported here, designated olg1 and olg2, are the longest yet proposed in prokaryotes and are among the best attested in terms of translation and evolutionary constraint. These results highlight a potentially large unexplored dimension of prokaryotic genomes.
Collapse
Affiliation(s)
- Michaela Kreitmeier
- Chair for Microbial Ecology, TUM School of Life Sciences, Technische Universität München, Weihenstephaner Berg 3, 85354 Freising, Germany
| | - Zachary Ardern
- Chair for Microbial Ecology, TUM School of Life Sciences, Technische Universität München, Weihenstephaner Berg 3, 85354 Freising, Germany
- Wellcome Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| | - Miriam Abele
- Bavarian Center for Biomolecular Mass Spectrometry (BayBioMS), TUM School of Life Sciences, Technische Universität München, Gregor-Mendel-Strasse 4, 85354 Freising, Germany
| | - Christina Ludwig
- Bavarian Center for Biomolecular Mass Spectrometry (BayBioMS), TUM School of Life Sciences, Technische Universität München, Gregor-Mendel-Strasse 4, 85354 Freising, Germany
| | - Siegfried Scherer
- Chair for Microbial Ecology, TUM School of Life Sciences, Technische Universität München, Weihenstephaner Berg 3, 85354 Freising, Germany
| | - Klaus Neuhaus
- Core Facility Microbiome, ZIEL – Institute for Food & Health, Technische Universität München, Weihenstephaner Berg 3, 85354 Freising, Germany
| |
Collapse
|
6
|
López-Pérez M, Jayakumar JM, Grant TA, Zaragoza-Solas A, Cabello-Yeves PJ, Almagro-Moreno S. Ecological diversification reveals routes of pathogen emergence in endemic Vibrio vulnificus populations. Proc Natl Acad Sci U S A 2021; 118:e2103470118. [PMID: 34593634 PMCID: PMC8501797 DOI: 10.1073/pnas.2103470118] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/09/2021] [Indexed: 12/17/2022] Open
Abstract
Pathogen emergence is a complex phenomenon that, despite its public health relevance, remains poorly understood. Vibrio vulnificus, an emergent human pathogen, can cause a deadly septicaemia with over 50% mortality rate. To date, the ecological drivers that lead to the emergence of clinical strains and the unique genetic traits that allow these clones to colonize the human host remain mostly unknown. We recently surveyed a large estuary in eastern Florida, where outbreaks of the disease frequently occur, and found endemic populations of the bacterium. We established two sampling sites and observed strong correlations between location and pathogenic potential. One site is significantly enriched with strains that belong to one phylogenomic cluster (C1) in which the majority of clinical strains belong. Interestingly, strains isolated from this site exhibit phenotypic traits associated with clinical outcomes, whereas strains from the second site belong to a cluster that rarely causes disease in humans (C2). Analyses of C1 genomes indicate unique genetic markers in the form of clinical-associated alleles with a potential role in virulence. Finally, metagenomic and physicochemical analyses of the sampling sites indicate that this marked cluster distribution and genetic traits are strongly associated with distinct biotic and abiotic factors (e.g., salinity, nutrients, or biodiversity), revealing how ecosystems generate selective pressures that facilitate the emergence of specific strains with pathogenic potential in a population. This knowledge can be applied to assess the risk of pathogen emergence from environmental sources and integrated toward the development of novel strategies for the prevention of future outbreaks.
Collapse
Affiliation(s)
- Mario López-Pérez
- Burnett School of Biomedical Sciences, College of Medicine, University of Central Florida, Orlando, FL 32816
- National Center for Integrated Coastal Research, University of Central Florida, Orlando, FL 32816
- Evolutionary Genomics Group, División de Microbiología, Universidad Miguel Hernández, 03550 Alicante, Spain
| | - Jane M Jayakumar
- Burnett School of Biomedical Sciences, College of Medicine, University of Central Florida, Orlando, FL 32816
- National Center for Integrated Coastal Research, University of Central Florida, Orlando, FL 32816
| | - Trudy-Ann Grant
- Burnett School of Biomedical Sciences, College of Medicine, University of Central Florida, Orlando, FL 32816
- National Center for Integrated Coastal Research, University of Central Florida, Orlando, FL 32816
| | - Asier Zaragoza-Solas
- Evolutionary Genomics Group, División de Microbiología, Universidad Miguel Hernández, 03550 Alicante, Spain
| | - Pedro J Cabello-Yeves
- Evolutionary Genomics Group, División de Microbiología, Universidad Miguel Hernández, 03550 Alicante, Spain
| | - Salvador Almagro-Moreno
- Burnett School of Biomedical Sciences, College of Medicine, University of Central Florida, Orlando, FL 32816;
- National Center for Integrated Coastal Research, University of Central Florida, Orlando, FL 32816
| |
Collapse
|
7
|
Coutinho FH, Zaragoza-Solas A, López-Pérez M, Barylski J, Zielezinski A, Dutilh BE, Edwards R, Rodriguez-Valera F. RaFAH: Host prediction for viruses of Bacteria and Archaea based on protein content. PATTERNS 2021; 2:100274. [PMID: 34286299 PMCID: PMC8276007 DOI: 10.1016/j.patter.2021.100274] [Citation(s) in RCA: 53] [Impact Index Per Article: 13.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/28/2020] [Revised: 11/23/2020] [Accepted: 05/07/2021] [Indexed: 02/06/2023]
Abstract
Culture-independent approaches have recently shed light on the genomic diversity of viruses of prokaryotes. One fundamental question when trying to understand their ecological roles is: which host do they infect? To tackle this issue we developed a machine-learning approach named Random Forest Assignment of Hosts (RaFAH), that uses scores to 43,644 protein clusters to assign hosts to complete or fragmented genomes of viruses of Archaea and Bacteria. RaFAH displayed performance comparable with that of other methods for virus-host prediction in three different benchmarks encompassing viruses from RefSeq, single amplified genomes, and metagenomes. RaFAH was applied to assembled metagenomic datasets of uncultured viruses from eight different biomes of medical, biotechnological, and environmental relevance. Our analyses led to the identification of 537 sequences of archaeal viruses representing unknown lineages, whose genomes encode novel auxiliary metabolic genes, shedding light on how these viruses interfere with the host molecular machinery. RaFAH is available at https://sourceforge.net/projects/rafah/. RaFAH was developed to predict the hosts of viruses of Bacteria and Archaea RaFAH displayed comparable or superior performance to other host-prediction tools RaFAH performed well across viromes from eight different ecosystems RaFAH identified hundreds of genomic sequences as derived from viruses of Archaea
Viruses that infect Bacteria and Archaea are ubiquitous and extremely abundant. Recent advances have led to the discovery of many thousands of complete and partial genomes of these biological entities. Understanding the biology of these viruses and how they influence their ecosystems depends on knowing which hosts they infect. We developed a tool that uses data from complete or fragmented genomes to predict the hosts of viruses using a machine-learning approach. Our tool, RaFAH, displayed performance comparable with or superior to that of other host-prediction tools. In addition, it identified hundreds of sequences as derived from the genomes of viruses of Archaea, which are one of the least characterized fractions of the global virosphere.
Collapse
Affiliation(s)
- Felipe Hernandes Coutinho
- Evolutionary Genomics Group, Departamento de Producción Vegetal y Microbiología, Universidad Miguel Hernández, Aptdo. 18., Ctra. Alicante-Valencia N-332, s/n, San Juan de Alicante, 03550 Alicante, Spain
| | - Asier Zaragoza-Solas
- Evolutionary Genomics Group, Departamento de Producción Vegetal y Microbiología, Universidad Miguel Hernández, Aptdo. 18., Ctra. Alicante-Valencia N-332, s/n, San Juan de Alicante, 03550 Alicante, Spain
| | - Mario López-Pérez
- Evolutionary Genomics Group, Departamento de Producción Vegetal y Microbiología, Universidad Miguel Hernández, Aptdo. 18., Ctra. Alicante-Valencia N-332, s/n, San Juan de Alicante, 03550 Alicante, Spain
| | - Jakub Barylski
- Molecular Virology Research Unit, Faculty of Biology, Adam Mickiewicz University Poznan, 61-614 Poznan, Poland
| | - Andrzej Zielezinski
- Department of Computational Biology, Faculty of Biology, Adam Mickiewicz University Poznan, 61-614 Poznan, Poland
| | - Bas E Dutilh
- Centre for Molecular and Biomolecular Informatics (CMBI), Radboud University Medical Centre/Radboud Institute for Molecular Life Sciences, 6525 GA Nijmegen, the Netherlands.,Theoretical Biology and Bioinformatics, Science for Life, Utrecht University (UU), 3584 CH Utrecht, the Netherlands
| | - Robert Edwards
- College of Science and Engineering, Flinders University, Bedford Park, SA 5042, Australia
| | - Francisco Rodriguez-Valera
- Evolutionary Genomics Group, Departamento de Producción Vegetal y Microbiología, Universidad Miguel Hernández, Aptdo. 18., Ctra. Alicante-Valencia N-332, s/n, San Juan de Alicante, 03550 Alicante, Spain.,Moscow Institute of Physics and Technology, Dolgoprudny 141701, Russia
| |
Collapse
|
8
|
Nakamura T, Yamada KD, Tomii K, Katoh K. Parallelization of MAFFT for large-scale multiple sequence alignments. Bioinformatics 2019; 34:2490-2492. [PMID: 29506019 PMCID: PMC6041967 DOI: 10.1093/bioinformatics/bty121] [Citation(s) in RCA: 592] [Impact Index Per Article: 98.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2017] [Accepted: 02/28/2018] [Indexed: 12/03/2022] Open
Abstract
Summary We report an update for the MAFFT multiple sequence alignment program to enable parallel calculation of large numbers of sequences. The G-INS-1 option of MAFFT was recently reported to have higher accuracy than other methods for large data, but this method has been impractical for most large-scale analyses, due to the requirement of large computational resources. We introduce a scalable variant, G-large-INS-1, which has equivalent accuracy to G-INS-1 and is applicable to 50 000 or more sequences. Availability and implementation This feature is available in MAFFT versions 7.355 or later at https://mafft.cbrc.jp/alignment/software/mpi.html. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Tsukasa Nakamura
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, University of Tokyo, Chiba, Japan.,Artificial Intelligence Research Center (AIRC), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan
| | - Kazunori D Yamada
- Artificial Intelligence Research Center (AIRC), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan.,Graduate School of Information Sciences, Tohoku University, Sendai, Japan
| | - Kentaro Tomii
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, University of Tokyo, Chiba, Japan.,Artificial Intelligence Research Center (AIRC), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan.,Biotechnology Research Institute for Drug Discovery (BRD), AIST, Tokyo, Japan.,AIST-Tokyo Tech Real World Big-Data Computation Open Innovation Laboratory (RWBC-OIL), Tokyo, Japan
| | - Kazutaka Katoh
- Artificial Intelligence Research Center (AIRC), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan.,Research Institute for Microbial Diseases, Osaka University, Suita, Japan
| |
Collapse
|
9
|
Deorowicz S, Debudaj-Grabysz A, Gudyś A. FAMSA: Fast and accurate multiple sequence alignment of huge protein families. Sci Rep 2016; 6:33964. [PMID: 27670777 PMCID: PMC5037421 DOI: 10.1038/srep33964] [Citation(s) in RCA: 89] [Impact Index Per Article: 9.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2016] [Accepted: 08/31/2016] [Indexed: 11/10/2022] Open
Abstract
Rapid development of modern sequencing platforms has contributed to the unprecedented growth of protein families databases. The abundance of sets containing hundreds of thousands of sequences is a formidable challenge for multiple sequence alignment algorithms. The article introduces FAMSA, a new progressive algorithm designed for fast and accurate alignment of thousands of protein sequences. Its features include the utilization of the longest common subsequence measure for determining pairwise similarities, a novel method of evaluating gap costs, and a new iterative refinement scheme. What matters is that its implementation is highly optimized and parallelized to make the most of modern computer platforms. Thanks to the above, quality indicators, i.e. sum-of-pairs and total-column scores, show FAMSA to be superior to competing algorithms, such as Clustal Omega or MAFFT for datasets exceeding a few thousand sequences. Quality does not compromise on time or memory requirements, which are an order of magnitude lower than those in the existing solutions. For example, a family of 415519 sequences was analyzed in less than two hours and required no more than 8 GB of RAM. FAMSA is available for free at http://sun.aei.polsl.pl/REFRESH/famsa.
Collapse
Affiliation(s)
- Sebastian Deorowicz
- Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Poland
| | | | - Adam Gudyś
- Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Poland
| |
Collapse
|