1
|
De Paolis E, Nero C, Micarelli E, Leoni G, Piermattei A, Trozzi R, Scarselli E, D'Alise AM, Giacò L, De Bonis M, Preziosi A, Daniele G, Piana D, Pasciuto T, Zannoni G, Minucci A, Scambia G, Urbani A, Fanfani F. Characterization of shared neoantigens landscape in Mismatch Repair Deficient Endometrial Cancer. NPJ Precis Oncol 2024; 8:283. [PMID: 39706858 DOI: 10.1038/s41698-024-00779-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2024] [Accepted: 12/11/2024] [Indexed: 12/23/2024] Open
Abstract
Endometrial cancer (EC) with Mismatch Repair deficiency (MMRd) is characterized by the accumulation of insertions/deletions at microsatellite sites. These mutations lead to the synthesis of frameshift peptides (FSPs) that represent tumor-specific neoantigens (nAg) proved to be shared across patients/tumors with MMRd. In this study, we explored the feasibility of a nAg-based cancer vaccination design in EC with MMRd. We adopted a whole exome sequencing approach and ad hoc bioinformatics pipelines to characterize FSPs in 35 patients with EC. A mean of 146 mutated mononucleotide repeats (MNRs) was identified with enrichment in the patients' group with MLH1 impairment. A high coverage emerged from the comparative analysis of the EC FSPs with the content of the previously validated NOUS-209 vaccine. We obtained pieces of evidence of FSPs translation as expressed proteins from Ribo-seq, supporting the potential as the target of vaccination. The development of a nAgs-based vaccine strategy in MMRd EC may be further explored.
Collapse
Affiliation(s)
- Elisa De Paolis
- Departmental Unit of Molecular and Genomic Diagnostics, Genomics Research Core Facility, Gemelli Science and Technology Park (GSTeP), Fondazione Policlinico Universitario A. Gemelli IRCCS, Rome, Italy
- Clinical Chemistry, Biochemistry and Molecular Biology Operations (UOC), Fondazione Policlinico Universitario A. Gemelli IRCCS, Rome, Italy
| | - Camilla Nero
- Department of Woman and Child's Health and Public Health Sciences, Fondazione Policlinico Universitario Agostino Gemelli IRCCS, Rome, Italy
- Catholic University of Sacred Heart, Rome, Italy
| | | | | | - Alessia Piermattei
- Pathology Unit, Department of Woman and Child's Health and Public Health Sciences, Fondazione Policlinico Universitario Agostino Gemelli IRCCS, Rome, Italy
| | - Rita Trozzi
- Department of Woman and Child's Health and Public Health Sciences, Fondazione Policlinico Universitario Agostino Gemelli IRCCS, Rome, Italy
- Catholic University of Sacred Heart, Rome, Italy
| | | | | | - Luciano Giacò
- Bioinformatics Research Core Facility, Gemelli Science and Technology Park (GSTeP), Fondazione Policlinico Universitario A. Gemelli IRCCS, Rome, Italy
| | - Maria De Bonis
- Departmental Unit of Molecular and Genomic Diagnostics, Genomics Research Core Facility, Gemelli Science and Technology Park (GSTeP), Fondazione Policlinico Universitario A. Gemelli IRCCS, Rome, Italy
| | - Alessia Preziosi
- Bioinformatics Research Core Facility, Gemelli Science and Technology Park (GSTeP), Fondazione Policlinico Universitario A. Gemelli IRCCS, Rome, Italy
| | - Gennaro Daniele
- Phase 1 Unit, Fondazione Policlinico Universitario Agostino Gemelli, IRCCS, Rome, Italy; Scientific Directorate, Fondazione Policlinico Universitario Agostino Gemelli IRCCS, Rome, Italy
| | - Diletta Piana
- Department of Basic Biotechnological Sciences, Intensivological and Perioperative Clinics, Catholic University of Sacred Heart, Rome, Italy
| | - Tina Pasciuto
- Research Core Facilty Data Collection, Gemelli Science and Technology Park (GSTeP), Fondazione Policlinico Universitario A. Gemelli IRCCS, Rome, Italy
- Section of Hygiene, University Department of Life Sciences and Public Health, Catholic University of Sacred Heart, Rome, Italy
| | - Gianfranco Zannoni
- Pathology Unit, Department of Woman and Child's Health and Public Health Sciences, Fondazione Policlinico Universitario Agostino Gemelli IRCCS, Rome, Italy
- Pathology Institute, Catholic University of Sacred Heart, Rome, Italy
| | - Angelo Minucci
- Departmental Unit of Molecular and Genomic Diagnostics, Genomics Research Core Facility, Gemelli Science and Technology Park (GSTeP), Fondazione Policlinico Universitario A. Gemelli IRCCS, Rome, Italy
| | - Giovanni Scambia
- Department of Woman and Child's Health and Public Health Sciences, Fondazione Policlinico Universitario Agostino Gemelli IRCCS, Rome, Italy
- Catholic University of Sacred Heart, Rome, Italy
| | - Andrea Urbani
- Clinical Chemistry, Biochemistry and Molecular Biology Operations (UOC), Fondazione Policlinico Universitario A. Gemelli IRCCS, Rome, Italy.
- Department of Basic Biotechnological Sciences, Intensivological and Perioperative Clinics, Catholic University of Sacred Heart, Rome, Italy.
| | - Francesco Fanfani
- Department of Woman and Child's Health and Public Health Sciences, Fondazione Policlinico Universitario Agostino Gemelli IRCCS, Rome, Italy
- Catholic University of Sacred Heart, Rome, Italy
| |
Collapse
|
2
|
Clauwaert J, McVey Z, Gupta R, Yannuzzi I, Menschaert G, Prensner JR. Deep learning to decode sites of RNA translation in normal and cancerous tissues. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.03.21.586110. [PMID: 38585907 PMCID: PMC10996544 DOI: 10.1101/2024.03.21.586110] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/09/2024]
Abstract
The biological process of RNA translation is fundamental to cellular life and has wide-ranging implications for human disease. Yet, accurately delineating the variation in RNA translation represents a significant challenge. Here, we develop RiboTIE, a transformer model-based approach to map global RNA translation. We find that RiboTIE offers unparalleled precision and sensitivity for ribosome profiling data. Application of RiboTIE to normal brain and medulloblastoma cancer samples enables high-resolution insights into disease regulation of RNA translation.
Collapse
Affiliation(s)
- Jim Clauwaert
- Division of Pediatric Hematology/Oncology, Department of Pediatrics, University of Michigan, Ann Arbor, MI, USA
- Chad Carr Pediatric Brain Tumor Center, University of Michigan, Ann Arbor, MI, USA
- Department of Biological Chemistry, University of Michigan, Ann Arbor, MI, USA
- These authors are corresponding authors: Jim Clauwaert, Gerben Menschaert, John R. Prensner
| | - Zahra McVey
- Novo Nordisk Research Centre Oxford, Novo Nordisk Ltd., Oxford, United Kingdom
| | - Ramneek Gupta
- Novo Nordisk Research Centre Oxford, Novo Nordisk Ltd., Oxford, United Kingdom
| | - Ian Yannuzzi
- Cancer Program, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Gerben Menschaert
- Department of Data Analysis and Mathematical Modelling, Ghent University, Belgium
- These authors share senior authorship: Gerben Menschaert, John R. Prensner
- These authors are corresponding authors: Jim Clauwaert, Gerben Menschaert, John R. Prensner
| | - John R. Prensner
- Division of Pediatric Hematology/Oncology, Department of Pediatrics, University of Michigan, Ann Arbor, MI, USA
- Chad Carr Pediatric Brain Tumor Center, University of Michigan, Ann Arbor, MI, USA
- Department of Biological Chemistry, University of Michigan, Ann Arbor, MI, USA
- These authors share senior authorship: Gerben Menschaert, John R. Prensner
- These authors are corresponding authors: Jim Clauwaert, Gerben Menschaert, John R. Prensner
| |
Collapse
|
3
|
Sng CCT, Kallor AA, Simpson BS, Bedran G, Alfaro J, Litchfield K. Untranslated regions (UTRs) are a potential novel source of neoantigens for personalised immunotherapy. Front Immunol 2024; 15:1347542. [PMID: 38558815 PMCID: PMC10978585 DOI: 10.3389/fimmu.2024.1347542] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2023] [Accepted: 02/19/2024] [Indexed: 04/04/2024] Open
Abstract
Background Neoantigens, mutated tumour-specific antigens, are key targets of anti-tumour immunity during checkpoint inhibitor (CPI) treatment. Their identification is fundamental to designing neoantigen-directed therapy. Non-canonical neoantigens arising from the untranslated regions (UTR) of the genome are an overlooked source of immunogenic neoantigens. Here, we describe the landscape of UTR-derived neoantigens and release a computational tool, PrimeCUTR, to predict UTR neoantigens generated by start-gain and stop-loss mutations. Methods We applied PrimeCUTR to a whole genome sequencing dataset of pre-treatment tumour samples from CPI-treated patients (n = 341). Cancer immunopeptidomic datasets were interrogated to identify MHC class I presentation of UTR neoantigens. Results Start-gain neoantigens were predicted in 72.7% of patients, while stop-loss mutations were found in 19.3% of patients. While UTR neoantigens only accounted 2.6% of total predicted neoantigen burden, they contributed 12.4% of neoantigens with high dissimilarity to self-proteome. More start-gain neoantigens were found in CPI responders, but this relationship was not significant when correcting for tumour mutational burden. While most UTR neoantigens are private, we identified two recurrent start-gain mutations in melanoma. Using immunopeptidomic datasets, we identify two distinct MHC class I-presented UTR neoantigens: one from a recurrent start-gain mutation in melanoma, and one private to Jurkat cells. Conclusion PrimeCUTR is a novel tool which complements existing neoantigen discovery approaches and has potential to increase the detection yield of neoantigens in personalised therapeutics, particularly for neoantigens with high dissimilarity to self. Further studies are warranted to confirm the expression and immunogenicity of UTR neoantigens.
Collapse
Affiliation(s)
- Christopher C. T. Sng
- Cancer Research UK Lung Cancer Centre of Excellence, University College London (UCL) Cancer Institute, London, United Kingdom
| | - Ashwin Adrian Kallor
- International Center for Cancer Vaccine Science, University of Gdansk, Gdansk, Poland
| | - Benjamin S. Simpson
- Cancer Research UK Lung Cancer Centre of Excellence, University College London (UCL) Cancer Institute, London, United Kingdom
| | - Georges Bedran
- International Center for Cancer Vaccine Science, University of Gdansk, Gdansk, Poland
| | - Javier Alfaro
- International Center for Cancer Vaccine Science, University of Gdansk, Gdansk, Poland
- Department of Biochemistry and Microbiology, University of Victoria, Victoria, BC, Canada
- Institute for Adaptive and Neural Computation, School of Informatics, University of Edinburgh, Edinburgh, United Kingdom
| | - Kevin Litchfield
- Cancer Research UK Lung Cancer Centre of Excellence, University College London (UCL) Cancer Institute, London, United Kingdom
| |
Collapse
|
4
|
Le NQK. Leveraging transformers-based language models in proteome bioinformatics. Proteomics 2023; 23:e2300011. [PMID: 37381841 DOI: 10.1002/pmic.202300011] [Citation(s) in RCA: 21] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2023] [Revised: 06/13/2023] [Accepted: 06/13/2023] [Indexed: 06/30/2023]
Abstract
In recent years, the rapid growth of biological data has increased interest in using bioinformatics to analyze and interpret this data. Proteomics, which studies the structure, function, and interactions of proteins, is a crucial area of bioinformatics. Using natural language processing (NLP) techniques in proteomics is an emerging field that combines machine learning and text mining to analyze biological data. Recently, transformer-based NLP models have gained significant attention for their ability to process variable-length input sequences in parallel, using self-attention mechanisms to capture long-range dependencies. In this review paper, we discuss the recent advancements in transformer-based NLP models in proteome bioinformatics and examine their advantages, limitations, and potential applications to improve the accuracy and efficiency of various tasks. Additionally, we highlight the challenges and future directions of using these models in proteome bioinformatics research. Overall, this review provides valuable insights into the potential of transformer-based NLP models to revolutionize proteome bioinformatics.
Collapse
Affiliation(s)
- Nguyen Quoc Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei, Taiwan
- AIBioMed Research Group, Taipei Medical University, Taipei, Taiwan
- Research Center for Artificial Intelligence in Medicine, Taipei Medical University, Taipei, Taiwan
- Translational Imaging Research Center, Taipei Medical University Hospital, Taipei, Taiwan
| |
Collapse
|
5
|
McGibbon M, Shave S, Dong J, Gao Y, Houston DR, Xie J, Yang Y, Schwaller P, Blay V. From intuition to AI: evolution of small molecule representations in drug discovery. Brief Bioinform 2023; 25:bbad422. [PMID: 38033290 PMCID: PMC10689004 DOI: 10.1093/bib/bbad422] [Citation(s) in RCA: 17] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2023] [Revised: 10/13/2023] [Accepted: 11/01/2023] [Indexed: 12/02/2023] Open
Abstract
Within drug discovery, the goal of AI scientists and cheminformaticians is to help identify molecular starting points that will develop into safe and efficacious drugs while reducing costs, time and failure rates. To achieve this goal, it is crucial to represent molecules in a digital format that makes them machine-readable and facilitates the accurate prediction of properties that drive decision-making. Over the years, molecular representations have evolved from intuitive and human-readable formats to bespoke numerical descriptors and fingerprints, and now to learned representations that capture patterns and salient features across vast chemical spaces. Among these, sequence-based and graph-based representations of small molecules have become highly popular. However, each approach has strengths and weaknesses across dimensions such as generality, computational cost, inversibility for generative applications and interpretability, which can be critical in informing practitioners' decisions. As the drug discovery landscape evolves, opportunities for innovation continue to emerge. These include the creation of molecular representations for high-value, low-data regimes, the distillation of broader biological and chemical knowledge into novel learned representations and the modeling of up-and-coming therapeutic modalities.
Collapse
Affiliation(s)
- Miles McGibbon
- Institute of Quantitative Biology, Biochemistry and Biotechnology, University of Edinburgh, Edinburgh, Scotland EH9 3BF, United Kingdom
| | - Steven Shave
- Institute of Quantitative Biology, Biochemistry and Biotechnology, University of Edinburgh, Edinburgh, Scotland EH9 3BF, United Kingdom
| | - Jie Dong
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha, 410013, China
| | - Yumiao Gao
- Institute of Quantitative Biology, Biochemistry and Biotechnology, University of Edinburgh, Edinburgh, Scotland EH9 3BF, United Kingdom
| | - Douglas R Houston
- Institute of Quantitative Biology, Biochemistry and Biotechnology, University of Edinburgh, Edinburgh, Scotland EH9 3BF, United Kingdom
| | - Jiancong Xie
- Key Laboratory of Machine Intelligence and Advanced Computing, Sun Yat-Sen University, Guangzhou, 510000, China
| | - Yuedong Yang
- Key Laboratory of Machine Intelligence and Advanced Computing, Sun Yat-Sen University, Guangzhou, 510000, China
| | - Philippe Schwaller
- Laboratory of Artificial Chemical Intelligence (LIAC), Institut des Sciences et Ingénierie Chimiques, Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
| | - Vincent Blay
- Institute of Quantitative Biology, Biochemistry and Biotechnology, University of Edinburgh, Edinburgh, Scotland EH9 3BF, United Kingdom
| |
Collapse
|
6
|
Prensner JR, Abelin JG, Kok LW, Clauser KR, Mudge JM, Ruiz-Orera J, Bassani-Sternberg M, Moritz RL, Deutsch EW, van Heesch S. What Can Ribo-Seq, Immunopeptidomics, and Proteomics Tell Us About the Noncanonical Proteome? Mol Cell Proteomics 2023; 22:100631. [PMID: 37572790 PMCID: PMC10506109 DOI: 10.1016/j.mcpro.2023.100631] [Citation(s) in RCA: 38] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2023] [Revised: 07/21/2023] [Accepted: 08/08/2023] [Indexed: 08/14/2023] Open
Abstract
Ribosome profiling (Ribo-Seq) has proven transformative for our understanding of the human genome and proteome by illuminating thousands of noncanonical sites of ribosome translation outside the currently annotated coding sequences (CDSs). A conservative estimate suggests that at least 7000 noncanonical ORFs are translated, which, at first glance, has the potential to expand the number of human protein CDSs by 30%, from ∼19,500 annotated CDSs to over 26,000 annotated CDSs. Yet, additional scrutiny of these ORFs has raised numerous questions about what fraction of them truly produce a protein product and what fraction of those can be understood as proteins according to conventional understanding of the term. Adding further complication is the fact that published estimates of noncanonical ORFs vary widely by around 30-fold, from several thousand to several hundred thousand. The summation of this research has left the genomics and proteomics communities both excited by the prospect of new coding regions in the human genome but searching for guidance on how to proceed. Here, we discuss the current state of noncanonical ORF research, databases, and interpretation, focusing on how to assess whether a given ORF can be said to be "protein coding."
Collapse
Affiliation(s)
- John R Prensner
- Division of Pediatric Hematology/Oncology, Department of Pediatrics, University of Michigan Medical School, Ann Arbor, Michigan, USA; Department of Biological Chemistry, University of Michigan Medical School, Ann Arbor, Michigan, USA.
| | | | - Leron W Kok
- Princess Máxima Center for Pediatric Oncology, Utrecht, The Netherlands
| | - Karl R Clauser
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA
| | - Jonathan M Mudge
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Cambridge, UK
| | - Jorge Ruiz-Orera
- Cardiovascular and Metabolic Sciences, Max Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC), Berlin, Germany
| | - Michal Bassani-Sternberg
- Ludwig Institute for Cancer Research, Agora Center Bugnon 25A, University of Lausanne, Lausanne, Switzerland; Department of Oncology, Centre Hospitalier Universitaire Vaudois (CHUV), Lausanne, Switzerland; Agora Cancer Research Centre, Lausanne, Switzerland
| | - Robert L Moritz
- Institute for Systems Biology (ISB), Seattle, Washington, USA
| | - Eric W Deutsch
- Institute for Systems Biology (ISB), Seattle, Washington, USA
| | | |
Collapse
|
7
|
Prensner JR, Abelin JG, Kok LW, Clauser KR, Mudge JM, Ruiz-Orera J, Bassani-Sternberg M, Deutsch EW, van Heesch S. What can Ribo-seq and proteomics tell us about the non-canonical proteome? BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.05.16.541049. [PMID: 37292611 PMCID: PMC10245706 DOI: 10.1101/2023.05.16.541049] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Ribosome profiling (Ribo-seq) has proven transformative for our understanding of the human genome and proteome by illuminating thousands of non-canonical sites of ribosome translation outside of the currently annotated coding sequences (CDSs). A conservative estimate suggests that at least 7,000 non-canonical open reading frames (ORFs) are translated, which, at first glance, has the potential to expand the number of human protein-coding sequences by 30%, from ∼19,500 annotated CDSs to over 26,000. Yet, additional scrutiny of these ORFs has raised numerous questions about what fraction of them truly produce a protein product and what fraction of those can be understood as proteins according to conventional understanding of the term. Adding further complication is the fact that published estimates of non-canonical ORFs vary widely by around 30-fold, from several thousand to several hundred thousand. The summation of this research has left the genomics and proteomics communities both excited by the prospect of new coding regions in the human genome, but searching for guidance on how to proceed. Here, we discuss the current state of non-canonical ORF research, databases, and interpretation, focusing on how to assess whether a given ORF can be said to be "protein-coding". In brief The human genome encodes thousands of non-canonical open reading frames (ORFs) in addition to protein-coding genes. As a nascent field, many questions remain regarding non-canonical ORFs. How many exist? Do they encode proteins? What level of evidence is needed for their verification? Central to these debates has been the advent of ribosome profiling (Ribo-seq) as a method to discern genome-wide ribosome occupancy, and immunopeptidomics as a method to detect peptides that are processed and presented by MHC molecules and not observed in traditional proteomics experiments. This article provides a synthesis of the current state of non-canonical ORF research and proposes standards for their future investigation and reporting. Highlights Combined use of Ribo-seq and proteomics-based methods enables optimal confidence in detecting non-canonical ORFs and their protein products.Ribo-seq can provide more sensitive detection of non-canonical ORFs, but data quality and analytical pipelines will impact results.Non-canonical ORF catalogs are diverse and span both high-stringency and low-stringency ORF nominations.A framework for standardized non-canonical ORF evidence will advance the research field.
Collapse
Affiliation(s)
- John R. Prensner
- Department of Pediatrics, Division of Pediatric Hematology/Oncology, University of Michigan Medical School, Ann Arbor, MI 48109, USA
| | | | - Leron W. Kok
- Princess Máxima Center for Pediatric Oncology, Heidelberglaan 25, 3584 CS, Utrecht, the Netherlands
| | - Karl R. Clauser
- Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA
| | - Jonathan M. Mudge
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Jorge Ruiz-Orera
- Cardiovascular and Metabolic Sciences, Max Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC), 13125 Berlin, Germany
| | - Michal Bassani-Sternberg
- Ludwig Institute for Cancer Research, University of Lausanne, Agora Center Bugnon 25A, 1005 Lausanne, Switzerland
- Department of Oncology, Centre hospitalier universitaire vaudois (CHUV), Rue du Bugnon 46, 1005 Lausanne, Switzerland
- Agora Cancer Research Centre, 1011 Lausanne, Switzerland
| | - Eric W. Deutsch
- Institute for Systems Biology (ISB), Seattle, Washington 98109, USA
| | - Sebastiaan van Heesch
- Princess Máxima Center for Pediatric Oncology, Heidelberglaan 25, 3584 CS, Utrecht, the Netherlands
| |
Collapse
|
8
|
Chothani S, Ho L, Schafer S, Rackham O. Discovering microproteins: making the most of ribosome profiling data. RNA Biol 2023; 20:943-954. [PMID: 38013207 PMCID: PMC10730196 DOI: 10.1080/15476286.2023.2279845] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/30/2023] [Indexed: 11/29/2023] Open
Abstract
Building a reference set of protein-coding open reading frames (ORFs) has revolutionized biological process discovery and understanding. Traditionally, gene models have been confirmed using cDNA sequencing and encoded translated regions inferred using sequence-based detection of start and stop combinations longer than 100 amino-acids to prevent false positives. This has led to small ORFs (smORFs) and their encoded proteins left un-annotated. Ribo-seq allows deciphering translated regions from untranslated irrespective of the length. In this review, we describe the power of Ribo-seq data in detection of smORFs while discussing the major challenge posed by data-quality, -depth and -sparseness in identifying the start and end of smORF translation. In particular, we outline smORF cataloguing efforts in humans and the large differences that have arisen due to variation in data, methods and assumptions. Although current versions of smORF reference sets can already be used as a powerful tool for hypothesis generation, we recommend that future editions should consider these data limitations and adopt unified processing for the community to establish a canonical catalogue of translated smORFs.
Collapse
Affiliation(s)
- Sonia Chothani
- Program in Cardiovascular and Metabolic Disorders, Duke-National University of Singapore, Singapore
| | - Lena Ho
- Program in Cardiovascular and Metabolic Disorders, Duke-National University of Singapore, Singapore
| | - Sebastian Schafer
- Program in Cardiovascular and Metabolic Disorders, Duke-National University of Singapore, Singapore
| | - Owen Rackham
- Program in Cardiovascular and Metabolic Disorders, Duke-National University of Singapore, Singapore
- School of Biological Sciences, University of Southampton, Southampton, UK
- The Alan Turing Institute, The British Library, London, UK
| |
Collapse
|