1
|
Zeng J, Song K, Wang J, Wen H, Zhou J, Ni T, Lu H, Yu Y. Characterization and optimization of 5´ untranslated region containing poly-adenine tracts in Kluyveromyces marxianus using machine-learning model. Microb Cell Fact 2024; 23:7. [PMID: 38172836 PMCID: PMC10763412 DOI: 10.1186/s12934-023-02271-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2023] [Accepted: 12/12/2023] [Indexed: 01/05/2024] Open
Abstract
BACKGROUND The 5´ untranslated region (5´ UTR) plays a key role in regulating translation efficiency and mRNA stability, making it a favored target in genetic engineering and synthetic biology. A common feature found in the 5´ UTR is the poly-adenine (poly(A)) tract. However, the effect of 5´ UTR poly(A) on protein production remains controversial. Machine-learning models are powerful tools for explaining the complex contributions of features, but models incorporating features of 5´ UTR poly(A) are currently lacking. Thus, our goal is to construct such a model, using natural 5´ UTRs from Kluyveromyces marxianus, a promising cell factory for producing heterologous proteins. RESULTS We constructed a mini-library consisting of 207 5´ UTRs harboring poly(A) and 34 5´ UTRs without poly(A) from K. marxianus. The effects of each 5´ UTR on the production of a GFP reporter were evaluated individually in vivo, and the resulting protein abundance spanned an approximately 450-fold range throughout. The data were used to train a multi-layer perceptron neural network (MLP-NN) model that incorporated the length and position of poly(A) as features. The model exhibited good performance in predicting protein abundance (average R2 = 0.7290). The model suggests that the length of poly(A) is negatively correlated with protein production, whereas poly(A) located between 10 and 30 nt upstream of the start codon (AUG) exhibits a weak positive effect on protein abundance. Using the model as guidance, the deletion or reduction of poly(A) upstream of 30 nt preceding AUG tended to improve the production of GFP and a feruloyl esterase. Deletions of poly(A) showed inconsistent effects on mRNA levels, suggesting that poly(A) represses protein production either with or without reducing mRNA levels. CONCLUSION The effects of poly(A) on protein production depend on its length and position. Integrating poly(A) features into machine-learning models improves simulation accuracy. Deleting or reducing poly(A) upstream of 30 nt preceding AUG tends to enhance protein production. This optimization strategy can be applied to enhance the yield of K. marxianus and other microbial cell factories.
Collapse
Affiliation(s)
- Junyuan Zeng
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Fudan University, Shanghai, China
- Shanghai Engineering Research Center of Industrial Microorganisms, Shanghai, 200438, China
| | - Kunfeng Song
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Fudan University, Shanghai, China
- Shanghai Engineering Research Center of Industrial Microorganisms, Shanghai, 200438, China
| | - Jingqi Wang
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Fudan University, Shanghai, China
- Shanghai Engineering Research Center of Industrial Microorganisms, Shanghai, 200438, China
| | - Haimei Wen
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Fudan University, Shanghai, China
- Shanghai Engineering Research Center of Industrial Microorganisms, Shanghai, 200438, China
| | - Jungang Zhou
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Fudan University, Shanghai, China
- Shanghai Engineering Research Center of Industrial Microorganisms, Shanghai, 200438, China
| | - Ting Ni
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Fudan University, Shanghai, China
- Shanghai Engineering Research Center of Industrial Microorganisms, Shanghai, 200438, China
| | - Hong Lu
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Fudan University, Shanghai, China
- Shanghai Engineering Research Center of Industrial Microorganisms, Shanghai, 200438, China
| | - Yao Yu
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Fudan University, Shanghai, China.
- Shanghai Engineering Research Center of Industrial Microorganisms, Shanghai, 200438, China.
| |
Collapse
|
2
|
Wang H, Zheng H, Wang C, Lu X, Zhao X, Li X. Insight into HOTAIR structural features and functions as landing pads for transcription regulation proteins. Biochem Biophys Res Commun 2017; 485:679-685. [PMID: 28235488 DOI: 10.1016/j.bbrc.2017.02.100] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2017] [Accepted: 02/19/2017] [Indexed: 10/20/2022]
Abstract
LncRNAs fulfill a wide range of regulatory functions at almost every process of gene expression. While derived from secondary structural features, lncRNAs may function as landing pads for transcription factors (TFs). In this paper, we detected the global structural landscape of 20,338 lncRNAs by utilizing a free energy minimization (MFE) algorithm, and identified the interactions between lncRNAs and TFs to analyze molecular association induced by the lncRNA structure. The accessibility analysis of full sequences as well as potential TF-binding fragments shows a large percentage of structural flanking sequence around the TF binding sites. This investigations paid great attention to the high-order architecture of HOTAIR lncRNA, and identified two coincident modular domains covering fragments 171-410bp and 811-1520bp via RNA-TF association predicting and in-silico computation mining. Then, the structural domains were implied potential landing pads to recruit regulatory proteins (13 TFs) and mediated coordinate regulation of transcription. Pathways and diseases enrichment analysis illustrated that the interacted TFs are significantly Pan-cancer relevant which is consistent with the known function of HOTAIR. Overall, the in-depth understanding of HOTAIR structure provides the first glimpse of coordinate regulation driven by modular features. The detailed architectural context could yield broad biological insights and provides a framework for comprehending lncRNA structure-function interrelationships.
Collapse
Affiliation(s)
- Hong Wang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China
| | - Hewei Zheng
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China
| | - Chenguang Wang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China
| | - Xiaoyan Lu
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China
| | - Xueying Zhao
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China
| | - Xia Li
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China.
| |
Collapse
|
3
|
Abstract
Current microarray technologies to determine RNA structure or measure
protein–RNA interactions rely on single-stranded, unstructured RNA probes
on a chip covering together all k-mers. Since space on the array
is limited, the problem is to efficiently design a compact library of unstructured
ℓ-long RNA probes, where each k-mer is
covered at least p times. Ray et al. designed such a library for
specific values of k, ℓ, and
p using ad-hoc rules. To our knowledge, there is no general
method to date to solve this problem. Here, we address the problem of finding a
minimum-size covering of all k-mers by
ℓ-long sequences with the desired properties for any value
of k, ℓ, and p. As we
prove that the problem is NP-hard, we give two solutions: the first is a greedy
algorithm with a logarithmic approximation ratio; the second, a heuristic greedy
approach based on random walks in de Bruijn graphs. The heuristic algorithm works
well in practice and produces a library of unstructured RNA probes that is only
∼1.1-times greater in size compared to the theoretical lower bound. We
present results for typical values of k and probe lengths
ℓ and show that our algorithm generates a library that
is significantly smaller than the library of Ray et al.; moreover, we show that
our algorithm outperforms naive methods. Our approach can be generalized and
extended to generate RNA or DNA oligo libraries with other desired properties. The
software is freely available online.
Collapse
Affiliation(s)
- Yaron Orenstein
- 1 Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology , Cambridge, MA
| | - Bonnie Berger
- 1 Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology , Cambridge, MA.,2 Department of Mathematics, Massachusetts Institute of Technology , Cambridge, MA
| |
Collapse
|