1
|
Miller CN, Jarrell-Hurtado S, Haag MV, Sara Ye Y, Simenc M, Alvarez-Maldonado P, Behnami S, Zhang L, Swift J, Papikian A, Yu J, Colt K, Ecker JR, Michael TP, Law JA, Busch W. A single-nuclei transcriptome census of the Arabidopsis maturing root identifies that MYB67 controls phellem cell maturation. Dev Cell 2025; 60:1377-1391.e7. [PMID: 39793584 DOI: 10.1016/j.devcel.2024.12.025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2024] [Revised: 09/10/2024] [Accepted: 12/11/2024] [Indexed: 01/13/2025]
Abstract
The periderm provides a protective barrier in many seed plant species. The development of the suberized phellem, which forms the outermost layer of this important tissue, has become a trait of interest for enhancing both plant resilience to stresses and plant-mediated CO2 sequestration in soils. Despite its importance, very few genes driving phellem development are known. Employing single-nuclei sequencing, we have generated an expression census capturing the complete developmental progression of Arabidopsis root phellem cells, from their progenitor cell type, the pericycle, through to their maturation. With this, we identify a whole suite of genes underlying this process, including MYB67, which we show has a role in phellem cell maturation. Our expression census and functional discoveries represent a resource, expanding our comprehension of secondary growth in plants. These data can be used to fuel discoveries and engineering efforts relevant to plant resilience and climate change.
Collapse
Affiliation(s)
- Charlotte N Miller
- Plant Molecular and Cellular Biology Laboratory, The Salk Institute for Biological Studies, La Jolla, CA 92037, USA
| | - Sean Jarrell-Hurtado
- Plant Molecular and Cellular Biology Laboratory, The Salk Institute for Biological Studies, La Jolla, CA 92037, USA
| | - Manisha V Haag
- Plant Molecular and Cellular Biology Laboratory, The Salk Institute for Biological Studies, La Jolla, CA 92037, USA
| | - Y Sara Ye
- Plant Molecular and Cellular Biology Laboratory, The Salk Institute for Biological Studies, La Jolla, CA 92037, USA
| | - Mathew Simenc
- Plant Molecular and Cellular Biology Laboratory, The Salk Institute for Biological Studies, La Jolla, CA 92037, USA
| | - Paloma Alvarez-Maldonado
- Plant Molecular and Cellular Biology Laboratory, The Salk Institute for Biological Studies, La Jolla, CA 92037, USA
| | - Sara Behnami
- Plant Molecular and Cellular Biology Laboratory, The Salk Institute for Biological Studies, La Jolla, CA 92037, USA
| | - Ling Zhang
- Plant Molecular and Cellular Biology Laboratory, The Salk Institute for Biological Studies, La Jolla, CA 92037, USA
| | - Joseph Swift
- Plant Molecular and Cellular Biology Laboratory, The Salk Institute for Biological Studies, La Jolla, CA 92037, USA
| | - Ashot Papikian
- Plant Molecular and Cellular Biology Laboratory, The Salk Institute for Biological Studies, La Jolla, CA 92037, USA
| | - Jingting Yu
- Integrative Genomics and Bioinformatics Core, The Salk Institute for Biological Studies, La Jolla, CA 92037, USA
| | - Kelly Colt
- Plant Molecular and Cellular Biology Laboratory, The Salk Institute for Biological Studies, La Jolla, CA 92037, USA
| | - Joseph R Ecker
- Plant Molecular and Cellular Biology Laboratory, The Salk Institute for Biological Studies, La Jolla, CA 92037, USA; Genomic Analysis Laboratory, The Salk Institute for Biological Studies, La Jolla, CA 92037, USA; Howard Hughes Medical Institute, The Salk Institute for Biological Studies, La Jolla, CA 92037, USA
| | - Todd P Michael
- Plant Molecular and Cellular Biology Laboratory, The Salk Institute for Biological Studies, La Jolla, CA 92037, USA
| | - Julie A Law
- Plant Molecular and Cellular Biology Laboratory, The Salk Institute for Biological Studies, La Jolla, CA 92037, USA; Division of Biological Sciences, University of California, San Diego, La Jolla, CA 92093, USA
| | - Wolfgang Busch
- Plant Molecular and Cellular Biology Laboratory, The Salk Institute for Biological Studies, La Jolla, CA 92037, USA.
| |
Collapse
|
2
|
Um DH, Knowles DA, Kaiser GE. Vector embeddings by sequence similarity and context for improved compression, similarity search, clustering, organization, and manipulation of cDNA libraries. Comput Biol Chem 2025; 114:108251. [PMID: 39602973 DOI: 10.1016/j.compbiolchem.2024.108251] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2023] [Revised: 10/10/2024] [Accepted: 10/11/2024] [Indexed: 11/29/2024]
Abstract
This paper demonstrates the utility of organized numerical representations of genes in research involving flat string gene formats (i.e., FASTA/FASTQ5). By assigning a unique vector embedding to each short sequence, it is possible to more efficiently cluster and improve upon compression performance for the string representations of cDNA libraries. Furthermore, by studying alternative coordinate vector embeddings trained on the context of codon triplets, we can demonstrate clustering based on amino acid properties. Employing this sequence embedding method to encode barcodes and cDNA sequences, we can improve the time complexity of similarity searches. By pairing vector embeddings with an algorithm that determines the vector proximity in Euclidean space, this approach enables quicker and more flexible sequence searches.
Collapse
Affiliation(s)
- Daniel H Um
- Department of Computer Science, Columbia University, New York, NY, USA.
| | - David A Knowles
- Department of Computer Science, Columbia University, New York, NY, USA; Department of Systems Biology, Columbia University, New York, NY, USA; The Data Science Institute, Columbia University, New York, NY, USA; New York Genome Center, New York, NY, USA.
| | - Gail E Kaiser
- Department of Computer Science, Columbia University, New York, NY, USA.
| |
Collapse
|
3
|
Du L, Chen J, Sun D, Zhao K, Zeng Q, Yang N. Krait2: a versatile software for microsatellite investigation, visualization and marker development. BMC Genomics 2025; 26:72. [PMID: 39863857 PMCID: PMC11762079 DOI: 10.1186/s12864-025-11252-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2024] [Accepted: 01/16/2025] [Indexed: 01/27/2025] Open
Abstract
BACKGROUND Microsatellites are highly polymorphic repeat sequences ubiquitously interspersed throughout almost all genomes which are widely used as powerful molecular markers in diverse fields. Microsatellite expansions play pivotal roles in gene expression regulation and are implicated in various neurological diseases and cancers. Although much effort has been devoted to developing efficient tools for microsatellite identification, there is still a lack of a powerful tool for large-scale microsatellite analysis. RESULTS We present Krait2, a user-friendly graphical tool for investigating perfect, imperfect and compound microsatellites from FASTA and FASTQ formatted genomic datasets. Krait2 not only provides features such as primer design, repeat filtering, repeat annotation and statistical analysis, but also offers various output formats to support customized downstream analysis. Moreover, it has capability of analyzing multiple genomes simultaneously and conducting comparative analysis. CONCLUSIONS Krait2 is a versatile and easy-to-use software for both novices and experts to identify and analyze microsatellites. The installer and source code are available at https://github.com/lmdu/krait2 .
Collapse
Affiliation(s)
- Lianming Du
- Antibiotics Research and Re-Evaluation Key Laboratory of Sichuan Province, School of Pharmacy, Chengdu University, Chengdu, 610106, China.
| | - Jiahao Chen
- Antibiotics Research and Re-Evaluation Key Laboratory of Sichuan Province, School of Pharmacy, Chengdu University, Chengdu, 610106, China
| | - Dalin Sun
- Antibiotics Research and Re-Evaluation Key Laboratory of Sichuan Province, School of Pharmacy, Chengdu University, Chengdu, 610106, China
| | - Kelei Zhao
- Antibiotics Research and Re-Evaluation Key Laboratory of Sichuan Province, School of Pharmacy, Chengdu University, Chengdu, 610106, China
| | - Qianglin Zeng
- Antibiotics Research and Re-Evaluation Key Laboratory of Sichuan Province, School of Pharmacy, Chengdu University, Chengdu, 610106, China
| | - Nan Yang
- Key Laboratory of Qinghai-Tibetan Plateau Animal Genetic Resource Reservation and Utilization, Sichuan Province and Ministry of Education, Southwest Minzu University, Chengdu, 610225, China.
| |
Collapse
|
4
|
Reinar WB, Krabberød AK, Lalun VO, Butenko MA, Jakobsen KS. Short tandem repeats delineate gene bodies across eukaryotes. Nat Commun 2024; 15:10902. [PMID: 39738068 PMCID: PMC11686069 DOI: 10.1038/s41467-024-55276-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2024] [Accepted: 12/05/2024] [Indexed: 01/01/2025] Open
Abstract
Short tandem repeats (STRs) have emerged as important and hypermutable sites where genetic variation correlates with gene expression in plant and animal systems. Recently, it has been shown that a broad range of transcription factors (TFs) are affected by STRs near or in the DNA target binding site. Despite this, the distribution of STR motif repetitiveness in eukaryote genomes is still largely unknown. Here, we identify monomer and dimer STR motif repetitiveness in 5.1 billion 10-bp windows upstream of translation starts and downstream of translation stops in 25 million genes spanning 1270 species across the eukaryotic Tree of Life. We report that all surveyed genomes have gene-proximal shifts in motif repetitiveness. Within genomes, variation in gene-proximal repetitiveness landscapes correlated to the function of genes; genes with housekeeping functions were depleted in upstream and downstream repetitiveness. Furthermore, the repetitiveness landscapes correlated with TF binding sites, indicating that gene function has evolved in conjunction with cis-regulatory STRs and TFs that recognize repetitive sites. These results suggest that the hypermutability inherent to STRs is canalized along the genome sequence and contributes to regulatory and eco-evolutionary dynamics in all eukaryotes.
Collapse
Affiliation(s)
- William B Reinar
- Centre for Ecological and Evolutionary Synthesis, Department of Biosciences, University of Oslo, Oslo, Norway.
- Section for Genetics and Evolutionary Biology, Department of Biosciences, University of Oslo, Oslo, Norway.
| | - Anders K Krabberød
- Centre for Ecological and Evolutionary Synthesis, Department of Biosciences, University of Oslo, Oslo, Norway
- Section for Genetics and Evolutionary Biology, Department of Biosciences, University of Oslo, Oslo, Norway
| | - Vilde O Lalun
- Centre for Ecological and Evolutionary Synthesis, Department of Biosciences, University of Oslo, Oslo, Norway
- Section for Genetics and Evolutionary Biology, Department of Biosciences, University of Oslo, Oslo, Norway
| | - Melinka A Butenko
- Centre for Ecological and Evolutionary Synthesis, Department of Biosciences, University of Oslo, Oslo, Norway
- Section for Genetics and Evolutionary Biology, Department of Biosciences, University of Oslo, Oslo, Norway
| | - Kjetill S Jakobsen
- Centre for Ecological and Evolutionary Synthesis, Department of Biosciences, University of Oslo, Oslo, Norway.
| |
Collapse
|
5
|
Shen W, Sipos B, Zhao L. SeqKit2: A Swiss army knife for sequence and alignment processing. IMETA 2024; 3:e191. [PMID: 38898985 PMCID: PMC11183193 DOI: 10.1002/imt2.191] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/23/2024] [Revised: 03/19/2024] [Accepted: 03/20/2024] [Indexed: 06/21/2024]
Abstract
In the era of ubiquitous high-throughput sequencing studies, there is a growing need for analysis tools that are not just performant but also comprehensive and user-friendly enough to cater to both novice and advanced users. This article introduces SeqKit2, the next iteration of the widely used sequence analysis tool SeqKit, featuring expanded functionality, performance optimizations, and support for additional compression methods. Retaining a pragmatic subcommand architecture, SeqKit2 represents substantial enhancement through the inclusion of 19 additional subcommands, expanding its overall repertoire to a total of 38 in eight categories. The new subcommands add functionality such as amplicon processing and robust, error-tolerant parsing of sequence records. In addition, three subcommands designed for real-time analysis are added for periodic monitoring of properties of FASTQ and Binary Alignment/Map alignment records and real-time streaming from multiple sequence files. The performance of SeqKit2 is benchmarked against the old version of SeqKit, Bioawk, Seqtk, and SeqFu tools. SeqKit2 consistently outperforms its predecessor, albeit with marginally higher memory usage, while maintaining competitive runtimes against other tools. With its broad functionality, proven usability, and ongoing development driven by user feedback, we hope that bioinformaticians will find SeqKit2 useful as a "Swiss army knife" of sequence and alignment processing-equally adept at facilitating ad hoc analyses and seamlessly integrating into larger pipelines.
Collapse
Affiliation(s)
- Wei Shen
- Department of Infectious Diseases, Key Laboratory of Molecular Biology for Infectious Diseases (Ministry of Education), Institute for Viral HepatitisThe Second Affiliated Hospital of Chongqing Medical UniversityChongqingChina
| | - Botond Sipos
- European Molecular Biology LaboratoryEuropean Bioinformatics InstituteHinxtonCambridgeshireUK
| | - Liuyang Zhao
- Department of Infectious Diseases, Key Laboratory of Molecular Biology for Infectious Diseases (Ministry of Education), Institute for Viral HepatitisThe Second Affiliated Hospital of Chongqing Medical UniversityChongqingChina
| |
Collapse
|
6
|
Kaplow IM, Lawler AJ, Schäffer DE, Srinivasan C, Sestili HH, Wirthlin ME, Phan BN, Prasad K, Brown AR, Zhang X, Foley K, Genereux DP, Zoonomia Consortium, Karlsson EK, Lindblad-Toh K, Meyer WK, Pfenning AR. Relating enhancer genetic variation across mammals to complex phenotypes using machine learning. Science 2023; 380:eabm7993. [PMID: 37104615 PMCID: PMC10322212 DOI: 10.1126/science.abm7993] [Citation(s) in RCA: 22] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2021] [Accepted: 02/23/2023] [Indexed: 04/29/2023]
Abstract
Protein-coding differences between species often fail to explain phenotypic diversity, suggesting the involvement of genomic elements that regulate gene expression such as enhancers. Identifying associations between enhancers and phenotypes is challenging because enhancer activity can be tissue-dependent and functionally conserved despite low sequence conservation. We developed the Tissue-Aware Conservation Inference Toolkit (TACIT) to associate candidate enhancers with species' phenotypes using predictions from machine learning models trained on specific tissues. Applying TACIT to associate motor cortex and parvalbumin-positive interneuron enhancers with neurological phenotypes revealed dozens of enhancer-phenotype associations, including brain size-associated enhancers that interact with genes implicated in microcephaly or macrocephaly. TACIT provides a foundation for identifying enhancers associated with the evolution of any convergently evolved phenotype in any large group of species with aligned genomes.
Collapse
Affiliation(s)
- Irene M. Kaplow
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, PA, USA
- Neuroscience Institute, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Alyssa J. Lawler
- Neuroscience Institute, Carnegie Mellon University, Pittsburgh, PA, USA
- Department of Biology, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Daniel E. Schäffer
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Chaitanya Srinivasan
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Heather H. Sestili
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Morgan E. Wirthlin
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, PA, USA
- Neuroscience Institute, Carnegie Mellon University, Pittsburgh, PA, USA
| | - BaDoi N. Phan
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, PA, USA
- Neuroscience Institute, Carnegie Mellon University, Pittsburgh, PA, USA
- Medical Scientist Training Program, University of Pittsburgh School of Medicine, Pittsburgh, PA, USA
| | - Kavya Prasad
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Ashley R. Brown
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Xiaomeng Zhang
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Kathleen Foley
- Department of Biological Sciences, Lehigh University, Bethlehem, PA, USA
| | - Diane P. Genereux
- Broad Institute, Cambridge, MA, USA
- Program in Bioinformatics and Integrative Biology, University of Massachusetts Chan Medical School, Worcester, MA, USA
| | | | - Elinor K. Karlsson
- Broad Institute, Cambridge, MA, USA
- Program in Bioinformatics and Integrative Biology, University of Massachusetts Chan Medical School, Worcester, MA, USA
| | - Kerstin Lindblad-Toh
- Broad Institute, Cambridge, MA, USA
- Science for Life Laboratory, Department of Medical Biochemistry and Microbiology, Uppsala University, Uppsala, Sweden
| | - Wynn K. Meyer
- Department of Biological Sciences, Lehigh University, Bethlehem, PA, USA
| | - Andreas R. Pfenning
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, PA, USA
- Neuroscience Institute, Carnegie Mellon University, Pittsburgh, PA, USA
- Department of Biology, Carnegie Mellon University, Pittsburgh, PA, USA
| |
Collapse
|
7
|
Piñeiro C, Pichel JC. BigSeqKit: a parallel Big Data toolkit to process FASTA and FASTQ files at scale. Gigascience 2022; 12:giad062. [PMID: 37522758 PMCID: PMC10388699 DOI: 10.1093/gigascience/giad062] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2023] [Revised: 05/25/2023] [Accepted: 07/10/2023] [Indexed: 08/01/2023] Open
Abstract
BACKGROUND High-throughput sequencing technologies have led to an unprecedented explosion in the amounts of sequencing data available, which are typically stored using FASTA and FASTQ files. We can find in the literature several tools to process and manipulate those type of files with the aim of transforming sequence data into biological knowledge. However, none of them are well fitted for processing efficiently very large files, likely in the order of terabytes in the following years, since they are based on sequential processing. Only some routines of the well-known seqkit tool are partly parallelized. In any case, its scalability is limited to use few threads on a single computing node. RESULTS Our approach, BigSeqKit, takes advantage of a high-performance computing-Big Data framework to parallelize and optimize the commands included in seqkit with the aim of speeding up the manipulation of FASTA/FASTQ files. In this way, in most cases, it is from tens to hundreds of times faster than several state-of-the-art tools. At the same time, our toolkit is easy to use and install on any kind of hardware platform (local server or cluster), and its routines can be used as a bioinformatics library or from the command line. CONCLUSIONS BigSeqKit is a very complete and ultra-fast toolkit to process and manipulate large FASTA and FASTQ files. It is publicly available at https://github.com/citiususc/BigSeqKit.
Collapse
Affiliation(s)
- César Piñeiro
- CiTIUS, Universidade de Santiago de Compostela, Santiago de Compostela 15782, Spain
| | - Juan C Pichel
- CiTIUS, Universidade de Santiago de Compostela, Santiago de Compostela 15782, Spain
| |
Collapse
|
8
|
Singh U, Wurtele ES. orfipy: a fast and flexible tool for extracting ORFs. Bioinformatics 2021; 37:3019-3020. [PMID: 33576786 PMCID: PMC8479652 DOI: 10.1093/bioinformatics/btab090] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2020] [Revised: 12/31/2020] [Accepted: 02/03/2021] [Indexed: 02/02/2023] Open
Abstract
SUMMARY Searching for open reading frames is a routine task and a critical step prior to annotating protein coding regions in newly sequenced genomes or de novo transcriptome assemblies. With the tremendous increase in genomic and transcriptomic data, faster tools are needed to handle large input datasets. These tools should be versatile enough to fine-tune search criteria and allow efficient downstream analysis. Here we present a new python based tool, orfipy, which allows the user to flexibly search for open reading frames in genomic and transcriptomic sequences. The search is rapid and is fully customizable, with a choice of FASTA and BED output formats. AVAILABILITY AND IMPLEMENTATION orfipy is implemented in python and is compatible with python v3.6 and higher. Source code: https://github.com/urmi-21/orfipy. Installation: from the source, or via PyPi (https://pypi.org/project/orfipy) or bioconda (https://anaconda.org/bioconda/orfipy). SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Urminder Singh
- Bioinformatics and Computational Biology Program, Iowa State University, Ames, IA 50011, USA
- Center for Metabolic Biology, Iowa State University, Ames, IA 50011, USA
- Department of Genetics Development and Cell Biology, Iowa State University, Ames, IA 50011, USA
| | - Eve Syrkin Wurtele
- Bioinformatics and Computational Biology Program, Iowa State University, Ames, IA 50011, USA
- Center for Metabolic Biology, Iowa State University, Ames, IA 50011, USA
- Department of Genetics Development and Cell Biology, Iowa State University, Ames, IA 50011, USA
| |
Collapse
|
9
|
Predicting Hot Spot Residues at Protein-DNA Binding Interfaces Based on Sequence Information. Interdiscip Sci 2020; 13:1-11. [PMID: 33068261 DOI: 10.1007/s12539-020-00399-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2020] [Revised: 09/27/2020] [Accepted: 10/01/2020] [Indexed: 10/23/2022]
Abstract
Hot spot residues at protein-DNA binding interfaces are hugely important for investigating the underlying mechanism of molecular recognition. Currently, there are a few tools available for identifying the hot spot residues in the protein-DNA complexes. In addition, the three-dimensional protein structures are needed in these tools. However, it is well known that the three-dimensional structures are unavailable for most proteins. Considering the limitation, we proposed a method, named SPDH, for predicting hot spot residues only based on protein sequences. Firstly, we obtained 133 features from physicochemical property, conservation, predicted solvent accessible surface area and structure. Then, we systematically assessed these features based on various feature selection methods to obtain the optimal feature subset and compared the models using four classical machine learning algorithms (support vector machine, random forest, logistic regression, and k-nearest neighbor) on the training dataset. We found that the variability of physicochemical property features between wild and mutative types was important on improving the performance of the prediction model. On the independent test set, our method achieved the performance with AUC of 0.760 and sensitivity of 0.808, and outperformed other methods. The data and source code can be downloaded at https://github.com/xialab-ahu/SPDH .
Collapse
|