1
|
Deep learning the cis-regulatory code for gene expression in selected model plants. Nat Commun 2024; 15:3488. [PMID: 38664394 PMCID: PMC11045779 DOI: 10.1038/s41467-024-47744-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2023] [Accepted: 04/09/2024] [Indexed: 04/28/2024] Open
Abstract
Elucidating the relationship between non-coding regulatory element sequences and gene expression is crucial for understanding gene regulation and genetic variation. We explored this link with the training of interpretable deep learning models predicting gene expression profiles from gene flanking regions of the plant species Arabidopsis thaliana, Solanum lycopersicum, Sorghum bicolor, and Zea mays. With over 80% accuracy, our models enabled predictive feature selection, highlighting e.g. the significant role of UTR regions in determining gene expression levels. The models demonstrated remarkable cross-species performance, effectively identifying both conserved and species-specific regulatory sequence features and their predictive power for gene expression. We illustrated the application of our approach by revealing causal links between genetic variation and gene expression changes across fourteen tomato genomes. Lastly, our models efficiently predicted genotype-specific expression of key functional gene groups, exemplified by underscoring known phenotypic and metabolic differences between Solanum lycopersicum and its wild, drought-resistant relative, Solanum pennellii.
Collapse
|
2
|
Chromosome-level genome assembly of the diploid oat species Avena longiglumis. Sci Data 2024; 11:412. [PMID: 38649380 PMCID: PMC11035610 DOI: 10.1038/s41597-024-03248-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2023] [Accepted: 04/10/2024] [Indexed: 04/25/2024] Open
Abstract
Diploid wild oat Avena longiglumis has nutritional and adaptive traits which are valuable for common oat (A. sativa) breeding. The combination of Illumina, Nanopore and Hi-C data allowed us to assemble a high-quality chromosome-level genome of A. longiglumis (ALO), evidenced by contig N50 of 12.68 Mb with 99% BUSCO completeness for the assembly size of 3,960.97 Mb. A total of 40,845 protein-coding genes were annotated. The assembled genome was composed of 87.04% repetitive DNA sequences. Dotplots of the genome assembly (PI657387) with two published ALO genomes were compared to indicate the conservation of gene order and equal expansion of all syntenic blocks among three genome assemblies. Two recent whole-genome duplication events were characterized in genomes of diploid Avena species. These findings provide new knowledge for the genomic features of A. longiglumis, give information about the species diversity, and will accelerate the functional genomics and breeding studies in oat and related cereal crops.
Collapse
|
3
|
A new gene finding tool GeneMark-ETP significantly improves the accuracy of automatic annotation of large eukaryotic genomes. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.01.13.524024. [PMID: 36711453 PMCID: PMC9882169 DOI: 10.1101/2023.01.13.524024] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]
Abstract
Large-scale genomic initiatives, such as the Earth BioGenome Project, require efficient methods for eukaryotic genome annotation. Here we present an automatic gene finder, GeneMark-ETP, integrating genomic-, transcriptomic- and protein-derived evidence that has been developed with a focus on large plant and animal genomes. GeneMark-ETP first identifies genomic loci where extrinsic data is sufficient for making gene predictions with 'high confidence'. The genes situated in the genomic space between the high confidence genes are predicted in the next stage. The set of high confidence genes serves as an initial training set for the statistical model. Further on, the model parameters are iteratively updated in the rounds of gene prediction and parameter re-estimation. Upon reaching convergence, GeneMark-ETP makes the final predictions and delivers the whole complement of predicted genes. GeneMark-ETP outperformed gene finders using a single type of extrinsic evidence. Comparisons with gene finders utilizing both transcript- and protein-derived extrinsic evidence, MAKER2, and TSEBRA, demonstrated that GeneMark-ETP delivered state-of-the-art gene prediction accuracy with the margin of outperforming existing approaches increasing in its applications to larger and more complex eukaryotic genomes.
Collapse
|
4
|
Modeling alternative translation initiation sites in plants reveals evolutionarily conserved cis-regulatory codes in eukaryotes. Genome Res 2024; 34:272-285. [PMID: 38479836 PMCID: PMC10984385 DOI: 10.1101/gr.278100.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2023] [Accepted: 02/15/2024] [Indexed: 03/22/2024]
Abstract
mRNA translation relies on identifying translation initiation sites (TISs) in mRNAs. Alternative TISs are prevalent across plant transcriptomes, but the mechanisms for their recognition are unclear. Using ribosome profiling and machine learning, we developed models for predicting alternative TISs in the tomato (Solanum lycopersicum). Distinct feature sets were predictive of AUG and nonAUG TISs in 5' untranslated regions and coding sequences, including a novel CU-rich sequence that promoted plant TIS activity, a translational enhancer found across dicots and monocots, and humans and viruses. Our results elucidate the mechanistic and evolutionary basis of TIS recognition, whereby cis-regulatory RNA signatures affect start site selection. The TIS prediction model provides global estimates of TISs to discover neglected protein-coding genes across plant genomes. The prevalence of cis-regulatory signatures across plant species, humans, and viruses suggests their broad and critical roles in reprogramming the translational landscape.
Collapse
|
5
|
MuDoGeR: Multi-Domain Genome recovery from metagenomes made easy. Mol Ecol Resour 2024; 24:e13904. [PMID: 37994269 DOI: 10.1111/1755-0998.13904] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2022] [Revised: 10/18/2023] [Accepted: 11/13/2023] [Indexed: 11/24/2023]
Abstract
Several computational frameworks and workflows that recover genomes from prokaryotes, eukaryotes and viruses from metagenomes exist. Yet, it is difficult for scientists with little bioinformatics experience to evaluate quality, annotate genes, dereplicate, assign taxonomy and calculate relative abundance and coverage of genomes belonging to different domains. MuDoGeR is a user-friendly tool tailored for those familiar with Unix command-line environment that makes it easy to recover genomes of prokaryotes, eukaryotes and viruses from metagenomes, either alone or in combination. We tested MuDoGeR using 24 individual-isolated genomes and 574 metagenomes, demonstrating the applicability for a few samples and high throughput. While MuDoGeR can recover eukaryotic viral sequences, its characterization is predominantly skewed towards bacterial and archaeal viruses, reflecting the field's current state. However, acting as a dynamic wrapper, the MuDoGeR is designed to constantly incorporate updates and integrate new tools, ensuring its ongoing relevance in the rapidly evolving field. MuDoGeR is open-source software available at https://github.com/mdsufz/MuDoGeR. Additionally, MuDoGeR is also available as a Singularity container.
Collapse
|
6
|
High-quality Momordica balsamina genome elucidates its potential use in improving stress resilience and therapeutic properties of bitter gourd. FRONTIERS IN PLANT SCIENCE 2024; 14:1258042. [PMID: 38333042 PMCID: PMC10851156 DOI: 10.3389/fpls.2023.1258042] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/13/2023] [Accepted: 12/29/2023] [Indexed: 02/10/2024]
Abstract
Introduction Momordica balsamina is the closest wild species that can be crossed with an important fruit vegetable crop, Momordica charantia, has immense medicinal value, and placed under II subclass of primary gene pool of bitter gourd. M. balsamina is tolerant to major biotic and abiotic stresses. Genome characterization of Momordica balsamina as a wild relative of bitter gourd will contribute to the knowledge of the gene pool available for improvement in bitter gourd. There is potential to transfer gene/s related to biotic resistance and medicinal importance from M. balsamina to M. charantia to produce high-quality, better yielding and stress tolerant bitter gourd genotypes. Methods The present study provides the first and high-quality chromosome-level genome assembly of M. balsamina with size 384.90 Mb and N50 30.96 Mb using sequence data from 10x Genomics, Nanopore, and Hi-C platforms. Results A total of 6,32,098 transposons elements; 2,15,379 simple sequence repeats; 5,67,483 transcription factor binding sites; 3,376 noncoding RNA genes; and 41,652 protein-coding genes were identified, and 4,347 disease resistance, 67 heat stress-related, 05 carotenoid-related, 15 salt stress-related, 229 cucurbitacin-related, 19 terpenes-related, 37 antioxidant activity, and 06 sex determination-related genes were characterized. Conclusion Genome sequencing of M. balsamina will facilitate interspecific introgression of desirable traits. This information is cataloged in the form of webgenomic resource available at http://webtom.cabgrid.res.in/mbger/. Our finding of comparative genome analysis will be useful to get insights into the patterns and processes associated with genome evolution and to uncover functional regions of cucurbit genomes.
Collapse
|
7
|
ToxCodAn-Genome: an automated pipeline for toxin-gene annotation in genome assembly of venomous lineages. Gigascience 2024; 13:giad116. [PMID: 38241143 PMCID: PMC10797961 DOI: 10.1093/gigascience/giad116] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2023] [Revised: 10/19/2023] [Accepted: 12/18/2023] [Indexed: 01/21/2024] Open
Abstract
BACKGROUND The rapid development of sequencing technologies resulted in a wide expansion of genomics studies using venomous lineages. This facilitated research focusing on understanding the evolution of adaptive traits and the search for novel compounds that can be applied in agriculture and medicine. However, the toxin annotation of genomes is a laborious and time-consuming task, and no consensus pipeline is currently available. No computational tool currently exists to address the challenges specific to toxin annotation and to ensure the reproducibility of the process. RESULTS Here, we present ToxCodAn-Genome, the first software designed to perform automated toxin annotation in genomes of venomous lineages. This pipeline was designed to retrieve the full-length coding sequences of toxins and to allow the detection of novel truncated paralogs and pseudogenes. We tested ToxCodAn-Genome using 12 genomes of venomous lineages and achieved high performance on recovering their current toxin annotations. This tool can be easily customized to allow improvements in the final toxin annotation set and can be expanded to virtually any venomous lineage. ToxCodAn-Genome is fast, allowing it to run on any personal computer, but it can also be executed in multicore mode, taking advantage of large high-performance servers. In addition, we provide a guide to direct future research in the venomics field to ensure a confident toxin annotation in the genome being studied. As a case study, we sequenced and annotated the toxin repertoire of Bothrops alternatus, which may facilitate future evolutionary and biomedical studies using vipers as models. CONCLUSIONS ToxCodAn-Genome is suitable to perform toxin annotation in the genome of venomous species and may help to improve the reproducibility of further studies. ToxCodAn-Genome and the guide are freely available at https://github.com/pedronachtigall/ToxCodAn-Genome.
Collapse
|
8
|
A chromosome-level genome assembly of the Rhus gall aphid Schlechtendalia chinensis provides insight into the endogenization of Parvovirus-like DNA sequences. BMC Genomics 2024; 25:16. [PMID: 38166596 PMCID: PMC10759679 DOI: 10.1186/s12864-023-09916-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Accepted: 12/15/2023] [Indexed: 01/05/2024] Open
Abstract
The Rhus gall aphid, Schlechtendalia chinensis, feeds on its primary host plant Rhus chinensis to induce galls, which have economic importance in medicines and the food industry. Rhus gall aphids have a unique life cycle and are economically beneficial but there is huge gap in genomic information about this group of aphids. Schlechtendalia chinensis induces rich-tannin galls on its host plant and is emerging as a model organism for both commercial applications and applied research in the context of gall production by insects. Here, we generated a high-quality chromosome-level assembly for the S. chinensis genome, enabling the comparison between S. chinensis and non-galling aphids. The final genome assembly is 344.59 Mb with 91.71% of the assembled sequences anchored into 13 chromosomes. We predicted 15,013 genes, of which 14,582 (97.13%) coding genes were annotated, and 99% of the predicted genes were anchored to the 13 chromosomes. This assembly reveals the endogenization of parvovirus-related DNA sequences (PRDs) in the S. chinensis genome, which could play a role in environmental adaptations. We demonstrated the characterization and classification of cytochrome P450s in the genome assembly, which are functionally crucial for sap-feeding insects and have roles in detoxification and insecticide resistance. This genome assembly also revealed the whole genome duplication events in S. chinensis, which can be considered in comparative evolutionary analysis. Our work represents a reference genome for gall-forming aphids that could be used for comparative genomic studies between galling and non-galling aphids and provides the first insight into the endogenization of PRDs in the genome of galling aphids. It also provides novel genetic information for future research on gall-formation and insect-plant interactions.
Collapse
|
9
|
Assembly and annotation of the black spruce genome provide insights on spruce phylogeny and evolution of stress response. G3 (BETHESDA, MD.) 2023; 14:jkad247. [PMID: 37875130 PMCID: PMC10755193 DOI: 10.1093/g3journal/jkad247] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/17/2023] [Revised: 05/17/2023] [Accepted: 10/09/2023] [Indexed: 10/26/2023]
Abstract
Black spruce (Picea mariana [Mill.] B.S.P.) is a dominant conifer species in the North American boreal forest that plays important ecological and economic roles. Here, we present the first genome assembly of P. mariana with a reconstructed genome size of 18.3 Gbp and NG50 scaffold length of 36.0 kbp. A total of 66,332 protein-coding sequences were predicted in silico and annotated based on sequence homology. We analyzed the evolutionary relationships between P. mariana and 5 other spruces for which complete nuclear and organelle genome sequences were available. The phylogenetic tree estimated from mitochondrial genome sequences agrees with biogeography; specifically, P. mariana was strongly supported as a sister lineage to P. glauca and 3 other taxa found in western North America, followed by the European Picea abies. We obtained mixed topologies with weaker statistical support in phylogenetic trees estimated from nuclear and chloroplast genome sequences, indicative of ancient reticulate evolution affecting these 2 genomes. Clustering of protein-coding sequences from the 6 Picea taxa and 2 Pinus species resulted in 34,776 orthogroups, 560 of which appeared to be specific to P. mariana. Analysis of these specific orthogroups and dN/dS analysis of positive selection signatures for 497 single-copy orthogroups identified gene functions mostly related to plant development and stress response. The P. mariana genome assembly and annotation provides a valuable resource for forest genetics research and applications in this broadly distributed species, especially in relation to climate adaptation.
Collapse
|
10
|
De novo transcriptome assembly of mouse male germ cells reveals novel genes, stage-specific bidirectional promoter activity, and noncoding RNA expression. Genome Res 2023; 33:gr.278060.123. [PMID: 38129075 PMCID: PMC10760527 DOI: 10.1101/gr.278060.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2023] [Accepted: 09/29/2023] [Indexed: 12/23/2023]
Abstract
In mammals, the adult testis is the tissue with the highest diversity in gene expression. Much of that diversity is attributed to germ cells, primarily meiotic spermatocytes and postmeiotic haploid spermatids. Exploiting a newly developed cell purification method, we profiled the transcriptomes of such postmitotic germ cells of mice. We used a de novo transcriptome assembly approach and identified thousands of novel expressed transcripts characterized by features distinct from those of known genes. Novel loci tend to be short in length, monoexonic, and lowly expressed. Most novel genes have arisen recently in evolutionary time and possess low coding potential. Nonetheless, we identify several novel protein-coding genes harboring open reading frames that encode proteins containing matches to conserved protein domains. Analysis of mass-spectrometry data from adult mouse testes confirms protein production from several of these novel genes. We also examine overlap between transcripts and repetitive elements. We find that although distinct families of repeats are expressed with differing temporal dynamics during spermatogenesis, we do not observe a general mode of regulation wherein repeats drive expression of nonrepetitive sequences in a cell type-specific manner. Finally, we observe many fairly long antisense transcripts originating from canonical gene promoters, pointing to pervasive bidirectional promoter activity during spermatogenesis that is distinct and more frequent compared with somatic cells.
Collapse
|
11
|
Identification and adaptive evolution analysis of glutaredoxin genes in Populus spp. PLANT BIOLOGY (STUTTGART, GERMANY) 2023; 25:1154-1170. [PMID: 37703550 DOI: 10.1111/plb.13580] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/16/2023] [Accepted: 08/30/2023] [Indexed: 09/15/2023]
Abstract
Glutaredoxin (GRX) is a class of small redox proteins widely involved in cellular redox homeostasis and the regulation of various cellular processes. The role of GRX gene in the differentiation of Populus spp. is rarely reported. We compared the similarities and differences of GRX genes among four sections of poplar using bioinformatics, corrected the annotations of some GRX genes, and focused on analysing their transcript profiling and adaptive evolution in Populus spp. A total of 219 GRX genes were identified in four sections of poplar, among which annotations for 13 genes were corrected. Differences in GRX genes were found between sect. Turanga, represented by P. euphratica, and other poplar sections. Most notably, P. euphratica had the smallest number of duplication events for GRX genes (n = 9) and no tandem duplications, whereas there were >25 duplication events for all other poplars. Furthermore, we detected 18 pairs of GRX genes under positive selection pressure in various sections of poplar, and identified two groups of GRX genes in the Salicaceae that potentially underwent positive selection. Expression profiling results showed that the PtrGRX34 and its orthologous genes were upregulated under stress treatments. In summary, the GRX gene family underwent expansion during poplar differentiation, and some genes underwent rapid evolution during this process, which may be beneficial for Populus spp. to adapt to environmental changes. This study may provide more insights into the molecular mechanisms of Populus spp. adaptation to environmental changes and the adaptive evolution of GRX genes.
Collapse
|
12
|
Introduction of Plant Transposon Annotation for Beginners. BIOLOGY 2023; 12:1468. [PMID: 38132293 PMCID: PMC10741241 DOI: 10.3390/biology12121468] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/07/2023] [Revised: 11/21/2023] [Accepted: 11/23/2023] [Indexed: 12/23/2023]
Abstract
Transposons are mobile DNA sequences that contribute large fractions of many plant genomes. They provide exclusive resources for tracking gene and genome evolution and for developing molecular tools for basic and applied research. Despite extensive efforts, it is still challenging to accurately annotate transposons, especially for beginners, as transposon prediction requires necessary expertise in both transposon biology and bioinformatics. Moreover, the complexity of plant genomes and the dynamic evolution of transposons also bring difficulties for genome-wide transposon discovery. This review summarizes the three major strategies for transposon detection including repeat-based, structure-based, and homology-based annotation, and introduces the transposon superfamilies identified in plants thus far, and some related bioinformatics resources for detecting plant transposons. Furthermore, it describes transposon classification and explains why the terms 'autonomous' and 'non-autonomous' cannot be used to classify the superfamilies of transposons. Lastly, this review also discusses how to identify misannotated transposons and improve the quality of the transposon database. This review provides helpful information about plant transposons and a beginner's guide on annotating these repetitive sequences.
Collapse
|
13
|
Unravelling the genome of the brackish water malaria vector Anopheles aquasalis. Sci Rep 2023; 13:20472. [PMID: 37993652 PMCID: PMC10665375 DOI: 10.1038/s41598-023-47830-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2023] [Accepted: 11/19/2023] [Indexed: 11/24/2023] Open
Abstract
Malaria is a severe public health problem in several developing tropical and subtropical countries. Anopheles aquasalis is the primary coastal malaria vector in Central and South America and the Caribbean Islands, and it has the peculiar feature of living in water with large changes in salinity. Recent research has recognised An. aquasalis as an important model for studying the interactions of murine and human Plasmodium parasites. This study presents the complete genome of An. aquasalis and offers insights into its evolution and physiology. The genome is similar in size and gene content to other Neotropical anophelines, with 162 Mb and 12,446 protein-coding genes. There are 1387 single-copy orthologs at the Diptera level (eg. An. gambiae, An. darlingi and Drosophila melanogaster). An. aquasalis diverged from An. darlingi, the primary malaria vector in inland South America, nearly 20 million years ago. Proteins related to ion transport and metabolism belong to the most abundant gene families with 660 genes. We identified gene families relevant to osmosis control (e.g., aquaporins, vacuolar-ATPases, Na+/K+-ATPases, and carbonic anhydrases). Evolutionary analysis suggests that all osmotic regulation genes are under strong purifying selection. We also observed low copy number variation in insecticide resistance and immunity-related genes for all known classical pathways. The data provided by this study offers candidate genes for further studies of parasite-vector interactions and for studies on how anophelines of brackish water deal with the high fluctuation in water salinity. We also established data and insights supporting An. aquasalis as an emerging Neotropical malaria vector model for genetic and molecular studies.
Collapse
|
14
|
Evaluation of Different Gene Prediction Tools in Coccidioides immitis. J Fungi (Basel) 2023; 9:1094. [PMID: 37998899 PMCID: PMC10672684 DOI: 10.3390/jof9111094] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2023] [Revised: 11/01/2023] [Accepted: 11/07/2023] [Indexed: 11/25/2023] Open
Abstract
Gene prediction is required to obtain optimal biologically meaningful information from genomic sequences, but automated gene prediction software is imperfect. In this study, we compare the original annotation of the Coccidioides immitis RS genome (the reference strain of C. immitis) to annotations using the Funannotate and Augustus genome prediction pipelines. A total of 25% of the originally predicted genes (denoted CIMG) were not found in either the Funannotate or Augustus predictions. A comparison of Funannotate and Augustus predictions also found overlapping but not identical sets of genes. The predicted genes found only in the original annotation (referred to as CIMG-unique) were less likely to have a meaningful functional annotation and a lower number of orthologs and homologs in other fungi than all CIMG genes predicted by the original annotation. The CIMG-unique genes were also more likely to be lineage-specific and poorly expressed. In addition, the CIMG-unique genes were found in clusters and tended to be more frequently associated with transposable elements than all CIMG-predicted genes. The CIMG-unique genes were more likely to have experimentally determined transcription start sites that were further away from the originally predicted transcription start sites, and experimentally determined initial transcription was less likely to result in stable CIMG-unique transcripts. A sample of CIMG-unique genes that were relatively well expressed and differentially expressed in mycelia and spherules was inspected in a genome browser, and the structure of only about half of them was found to be supported by RNA-seq data. These data suggest that some of the CIMG-unique genes are not authentic gene predictions. Genes that were predicted only by the Funannotate pipeline were also less likely to have a meaningful functional annotation, be shorter, and express less well than all the genes predicted by Funannotate. C. immitis genes predicted by more than one annotation are more likely to have predicted functions, many orthologs and homologs, and be well expressed. Lineage-specific genes are relatively uncommon in this group. These data emphasize the importance and limitations of gene prediction software and suggest that improvements to the annotation of the C. immitis genome should be considered.
Collapse
|
15
|
A genomic panel for studying C3-C4 intermediate photosynthesis in the Brassiceae tribe. PLANT, CELL & ENVIRONMENT 2023; 46:3611-3627. [PMID: 37431820 DOI: 10.1111/pce.14662] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/02/2023] [Revised: 05/18/2023] [Accepted: 06/23/2023] [Indexed: 07/12/2023]
Abstract
Research on C4 and C3-C4 photosynthesis has attracted significant attention because the understanding of the genetic underpinnings of these traits will support the introduction of its characteristics into commercially relevant crop species. We used a panel of 19 taxa of 18 Brassiceae species with different photosynthesis characteristics (C3 and C3-C4) with the following objectives: (i) create draft genome assemblies and annotations, (ii) quantify orthology levels using synteny maps between all pairs of taxa, (iii) describe the phylogenetic relatedness across all the species, and (iv) track the evolution of C3-C4 intermediate photosynthesis in the Brassiceae tribe. Our results indicate that the draft de novo genome assemblies are of high quality and cover at least 90% of the gene space. Therewith we more than doubled the sampling depth of genomes of the Brassiceae tribe that comprises commercially important as well as biologically interesting species. The gene annotation generated high-quality gene models, and for most genes extensive upstream sequences are available for all taxa, yielding potential to explore variants in regulatory sequences. The genome-based phylogenetic tree of the Brassiceae contained two main clades and indicated that the C3-C4 intermediate photosynthesis has evolved five times independently. Furthermore, our study provides the first genomic support of the hypothesis that Diplotaxis muralis is a natural hybrid of D. tenuifolia and D. viminea. Altogether, the de novo genome assemblies and the annotations reported in this study are a valuable resource for research on the evolution of C3-C4 intermediate photosynthesis.
Collapse
|
16
|
Abstract
Pangolins form a group of scaly mammals that are trafficked at record numbers for their meat and purported medicinal properties. Despite their conservation concern, knowledge of their evolution is limited by a paucity of genomic data. We aim to produce exhaustive genomic resources that include 3,238 orthologous genes and whole-genome polymorphisms to assess the evolution of all eight extant pangolin species. Robust orthologous gene-based phylogenies recovered the monophyly of the three genera and highlighted the existence of an undescribed species closely related to Southeast Asian pangolins. Signatures of middle Miocene admixture between an extinct, possibly European, lineage and the ancestor of Southeast Asian pangolins, provide new insights into the early evolutionary history of the group. Demographic trajectories and genome-wide heterozygosity estimates revealed contrasts between continental versus island populations and species lineages, suggesting that conservation planning should consider intraspecific patterns. With the expected loss of genomic diversity from recent, extensive trafficking not yet realized in pangolins, we recommend that populations be genetically surveyed to anticipate any deleterious impact of the illegal trade. Finally, we produce a complete set of genomic resources that will be integral for future conservation management and forensic endeavors for pangolins, including tracing their illegal trade. These comprise the completion of whole-genomes for pangolins through the hybrid assembly of the first reference genome for the giant pangolin (Smutsia gigantea) and new draft genomes (∼43x-77x) for four additional species, as well as a database of orthologous genes with over 3.4 million polymorphic sites.
Collapse
|
17
|
Translation initiation at AUG and non-AUG triplets in plants. PLANT SCIENCE : AN INTERNATIONAL JOURNAL OF EXPERIMENTAL PLANT BIOLOGY 2023; 335:111822. [PMID: 37574140 DOI: 10.1016/j.plantsci.2023.111822] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/15/2022] [Revised: 07/22/2023] [Accepted: 08/07/2023] [Indexed: 08/15/2023]
Abstract
In plants and other eukaryotes, precise selection of translation initiation site (TIS) on mRNAs shapes the proteome in response to cellular events or environmental cues. The canonical translation of mRNAs initiates at a 5' proximal AUG codon in a favorable context. However, the coding and non-coding regions of plant genomes contain numerous unannotated alternative AUG and non-AUG TISs. Determining how and why these unexpected and prevalent TISs are activated in plants has emerged as an exciting research area. In this review, we focus on the selection of plant TISs and highlight studies that revealed previously unannotated TISs used in vivo via comparative genomics and genome-wide profiling of ribosome positioning and protein N-terminal ends. The biological signatures of non-AUG TIS-initiated open reading frames (ORFs) in plants are also discussed. We describe what is understood about cis-regulatory RNA elements and trans-acting eukaryotic initiation factors (eIFs) in the site selection for translation initiation by featuring the findings in plants along with supporting findings in non-plant species. The prevalent, unannotated TISs provide a hidden reservoir of ORFs that likely help reshape plant proteomes in response to developmental or environmental cues. These findings underscore the importance of understanding the mechanistic basis of TIS selection to functionally annotate plant genomes, especially for crops with large genomes.
Collapse
|
18
|
A practical approach to genome assembly and annotation of Basidiomycota using the example of Armillaria. Biotechniques 2023; 75:115-128. [PMID: 37681497 DOI: 10.2144/btn-2023-0023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/09/2023] Open
Abstract
Technological advancements in genome sequencing, assembly and annotation platforms and algorithms that resulted in several genomic studies have created an opportunity to further our understanding of the biology of phytopathogens, including Armillaria species. Most Armillaria species are facultative necrotrophs that cause root- and stem-rot, usually on woody plants, significantly impacting agriculture and forestry worldwide. Genome sequencing, assembly and annotation in terms of samples used and methods applied in Armillaria genome projects are evaluated in this review. Infographic guidelines and a database of resources to facilitate future Armillaria genome projects were developed. Knowledge gained from genomic studies of Armillaria species is summarized and prospects for further research are provided. This guide can be applied to other diploid and dikaryotic fungal genomics.
Collapse
|
19
|
Galba: genome annotation with miniprot and AUGUSTUS. BMC Bioinformatics 2023; 24:327. [PMID: 37653395 PMCID: PMC10472564 DOI: 10.1186/s12859-023-05449-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2023] [Accepted: 08/21/2023] [Indexed: 09/02/2023] Open
Abstract
BACKGROUND The Earth Biogenome Project has rapidly increased the number of available eukaryotic genomes, but most released genomes continue to lack annotation of protein-coding genes. In addition, no transcriptome data is available for some genomes. RESULTS Various gene annotation tools have been developed but each has its limitations. Here, we introduce GALBA, a fully automated pipeline that utilizes miniprot, a rapid protein-to-genome aligner, in combination with AUGUSTUS to predict genes with high accuracy. Accuracy results indicate that GALBA is particularly strong in the annotation of large vertebrate genomes. We also present use cases in insects, vertebrates, and a land plant. GALBA is fully open source and available as a docker image for easy execution with Singularity in high-performance computing environments. CONCLUSIONS Our pipeline addresses the critical need for accurate gene annotation in newly sequenced genomes, and we believe that GALBA will greatly facilitate genome annotation for diverse organisms.
Collapse
|
20
|
Pangenome-based trajectories of intracellular gene transfers in Poaceae unveil high cumulation in Triticeae. PLANT PHYSIOLOGY 2023; 193:578-594. [PMID: 37249052 PMCID: PMC10469385 DOI: 10.1093/plphys/kiad319] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/14/2023] [Accepted: 05/04/2023] [Indexed: 05/31/2023]
Abstract
Intracellular gene transfers (IGTs) between the nucleus and organelles, including plastids and mitochondria, constantly reshape the nuclear genome during evolution. Despite the substantial contribution of IGTs to genome variation, the dynamic trajectories of IGTs at the pangenomic level remain elusive. Here, we developed an approach, IGTminer, that maps the evolutionary trajectories of IGTs using collinearity and gene reannotation across multiple genome assemblies. We applied IGTminer to create a nuclear organellar gene (NOG) map across 67 genomes covering 15 Poaceae species, including important crops. The resulting NOGs were verified by experiments and sequencing data sets. Our analysis revealed that most NOGs were recently transferred and lineage specific and that Triticeae species tended to have more NOGs than other Poaceae species. Wheat (Triticum aestivum) had a higher retention rate of NOGs than maize (Zea mays) and rice (Oryza sativa), and the retained NOGs were likely involved in photosynthesis and translation pathways. Large numbers of NOG clusters were aggregated in hexaploid wheat during 2 rounds of polyploidization, contributing to the genetic diversity among modern wheat accessions. We implemented an interactive web server to facilitate the exploration of NOGs in Poaceae. In summary, this study provides resources and insights into the roles of IGTs in shaping interspecies and intraspecies genome variation and driving plant genome evolution.
Collapse
|
21
|
Inferring and comparing metabolism across heterogeneous sets of annotated genomes using AuCoMe. Genome Res 2023; 33:gr.277056.122. [PMID: 37468308 PMCID: PMC10629481 DOI: 10.1101/gr.277056.122] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2022] [Accepted: 05/23/2023] [Indexed: 07/21/2023]
Abstract
Comparative analysis of genome-scale metabolic networks (GSMNs) may yield important information on the biology, evolution, and adaptation of species. However, it is impeded by the high heterogeneity of the quality and completeness of structural and functional genome annotations, which may bias the results of such comparisons. To address this issue, we developed AuCoMe, a pipeline to automatically reconstruct homogeneous GSMNs from a heterogeneous set of annotated genomes without discarding available manual annotations. We tested AuCoMe with three data sets, one bacterial, one fungal, and one algal, and showed that it successfully reduces technical biases while capturing the metabolic specificities of each organism. Our results also point out shared and divergent metabolic traits among evolutionarily distant algae, underlining the potential of AuCoMe to accelerate the broad exploration of metabolic evolution across the tree of life.
Collapse
|
22
|
Genome sequencing and application of Taiwanese macaque Macaca cyclopis. Sci Rep 2023; 13:11545. [PMID: 37460589 DOI: 10.1038/s41598-023-38402-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2022] [Accepted: 07/07/2023] [Indexed: 07/20/2023] Open
Abstract
Formosan macaque (Macaca cyclopis) is the only non-human primate in Taiwan Island. We performed de novo hybrid assembly for M. cyclopis using Illumina paired-end short reads, mate-pair reads and Nanopore long reads and obtained 5065 contigs with a N50 of 2.66 megabases. M. cyclopis contigs > = 10 kb were assigned to chromosomes using Indian rhesus macaque (Macaca mulatta mulatta) genome assembly Mmul_10 as reference, resulting in a draft of M. cyclopis genome of 2,846,042,475 bases, distributed in 21 chromosomes. The draft genome contains 23,462 transcriptional origins (genes), capable of expressing 716,231 exons in 59,484 transcripts. Genome-based phylogenetic study using the assembled M. cyclopis genome together with genomes of four other macaque species, human, orangutan and chimpanzee showed similar result as previously reported. However, the M. cyclopis species was found to diverge from Chinese M. mulatta lasiota about 1.8 million years ago. Fossil gene analysis detected the presence of gap and pol endogenous viral elements of simian retrovirus in all macaques tested, including M. fascicularis, M. m. mulatta and M. cyclopis. However, M. cyclopis showed ~ 2 times less in number and more uniform in chromosomal locations. The constrain in foreign genome disturbance, presumably due to geographical isolation, should be able to simplify genomics-related investigations, making M. cyclopis an ideal primate species for medical research.
Collapse
|
23
|
Welcome to the big leaves: Best practices for improving genome annotation in non-model plant genomes. APPLICATIONS IN PLANT SCIENCES 2023; 11:e11533. [PMID: 37601314 PMCID: PMC10439824 DOI: 10.1002/aps3.11533] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/04/2022] [Revised: 02/04/2023] [Accepted: 02/10/2023] [Indexed: 08/22/2023]
Abstract
Premise Robust standards to evaluate quality and completeness are lacking in eukaryotic structural genome annotation, as genome annotation software is developed using model organisms and typically lacks benchmarking to comprehensively evaluate the quality and accuracy of the final predictions. The annotation of plant genomes is particularly challenging due to their large sizes, abundant transposable elements, and variable ploidies. This study investigates the impact of genome quality, complexity, sequence read input, and method on protein-coding gene predictions. Methods The impact of repeat masking, long-read and short-read inputs, and de novo and genome-guided protein evidence was examined in the context of the popular BRAKER and MAKER workflows for five plant genomes. The annotations were benchmarked for structural traits and sequence similarity. Results Benchmarks that reflect gene structures, reciprocal similarity search alignments, and mono-exonic/multi-exonic gene counts provide a more complete view of annotation accuracy. Transcripts derived from RNA-read alignments alone are not sufficient for genome annotation. Gene prediction workflows that combine evidence-based and ab initio approaches are recommended, and a combination of short and long reads can improve genome annotation. Adding protein evidence from de novo assemblies, genome-guided transcriptome assemblies, or full-length proteins from OrthoDB generates more putative false positives as implemented in the current workflows. Post-processing with functional and structural filters is highly recommended. Discussion While the annotation of non-model plant genomes remains complex, this study provides recommendations for inputs and methodological approaches. We discuss a set of best practices to generate an optimal plant genome annotation and present a more robust set of metrics to evaluate the resulting predictions.
Collapse
|
24
|
Pest status, molecular evolution, and epigenetic factors derived from the genome assembly of Frankliniella fusca, a thysanopteran phytovirus vector. BMC Genomics 2023; 24:343. [PMID: 37344773 DOI: 10.1186/s12864-023-09375-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2022] [Accepted: 05/13/2023] [Indexed: 06/23/2023] Open
Abstract
BACKGROUND The tobacco thrips (Frankliniella fusca Hinds; family Thripidae; order Thysanoptera) is an important pest that can transmit viruses such as the tomato spotted wilt orthotospovirus to numerous economically important agricultural row crops and vegetables. The structural and functional genomics within the order Thysanoptera has only begun to be explored. Within the > 7000 known thysanopteran species, the melon thrips (Thrips palmi Karny) and the western flower thrips (Frankliniella occidentalis Pergrande) are the only two thysanopteran species with assembled genomes. RESULTS A genome of F. fusca was assembled by long-read sequencing of DNA from an inbred line. The final assembly size was 370 Mb with a single copy ortholog completeness of ~ 99% with respect to Insecta. The annotated genome of F. fusca was compared with the genome of its congener, F. occidentalis. Results revealed many instances of lineage-specific differences in gene content. Analyses of sequence divergence between the two Frankliniella species' genomes revealed substitution patterns consistent with positive selection in ~ 5% of the protein-coding genes with 1:1 orthologs. Further, gene content related to its pest status, such as xenobiotic detoxification and response to an ambisense-tripartite RNA virus (orthotospovirus) infection was compared with F. occidentalis. Several F. fusca genes related to virus infection possessed signatures of positive selection. Estimation of CpG depletion, a mutational consequence of DNA methylation, revealed that F. fusca genes that were downregulated and alternatively spliced in response to virus infection were preferentially targeted by DNA methylation. As in many other insects, DNA methylation was enriched in exons in Frankliniella, but gene copies with homology to DNA methyltransferase 3 were numerous and fragmented. This phenomenon seems to be relatively unique to thrips among other insect groups. CONCLUSIONS The F. fusca genome assembly provides an important resource for comparative genomic analyses of thysanopterans. This genomic foundation allows for insights into molecular evolution, gene regulation, and loci important to agricultural pest status.
Collapse
|
25
|
CALANGO: A phylogeny-aware comparative genomics tool for discovering quantitative genotype-phenotype associations across species. PATTERNS (NEW YORK, N.Y.) 2023; 4:100728. [PMID: 37409050 PMCID: PMC10318336 DOI: 10.1016/j.patter.2023.100728] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/20/2022] [Revised: 12/08/2022] [Accepted: 03/15/2023] [Indexed: 07/07/2023]
Abstract
Living species vary significantly in phenotype and genomic content. Sophisticated statistical methods linking genes with phenotypes within a species have led to breakthroughs in complex genetic diseases and genetic breeding. Despite the abundance of genomic and phenotypic data available for thousands of species, finding genotype-phenotype associations across species is challenging due to the non-independence of species data resulting from common ancestry. To address this, we present CALANGO (comparative analysis with annotation-based genomic components), a phylogeny-aware comparative genomics tool to find homologous regions and biological roles associated with quantitative phenotypes across species. In two case studies, CALANGO identified both known and previously unidentified genotype-phenotype associations. The first study revealed unknown aspects of the ecological interaction between Escherichia coli, its integrated bacteriophages, and the pathogenicity phenotype. The second identified an association between maximum height in angiosperms and the expansion of a reproductive mechanism that prevents inbreeding and increases genetic diversity, with implications for conservation biology and agriculture.
Collapse
|
26
|
GALBA: Genome Annotation with Miniprot and AUGUSTUS. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.04.10.536199. [PMID: 37090650 PMCID: PMC10120627 DOI: 10.1101/2023.04.10.536199] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/25/2023]
Abstract
The Earth Biogenome Project has rapidly increased the number of available eukaryotic genomes, but most released genomes continue to lack annotation of protein-coding genes. In addition, no transcriptome data is available for some genomes. Various gene annotation tools have been developed but each has its limitations. Here, we introduce GALBA, a fully automated pipeline that utilizes miniprot, a rapid protein- to-genome aligner, in combination with AUGUSTUS to predict genes with high accuracy. Accuracy results indicate that GALBA is particularly strong in the annotation of large vertebrate genomes. We also present use cases in insects, vertebrates, and a previously unannotated land plant. GALBA is fully open source and available as a docker image for easy execution with Singularity in high-performance computing environments. Our pipeline addresses the critical need for accurate gene annotation in newly sequenced genomes, and we believe that GALBA will greatly facilitate genome annotation for diverse organisms.
Collapse
|
27
|
Contribution of Retrotransposons to the Pathogenesis of Type 1 Diabetes and Challenges in Analysis Methods. Int J Mol Sci 2023; 24:ijms24043104. [PMID: 36834511 PMCID: PMC9966460 DOI: 10.3390/ijms24043104] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2022] [Revised: 01/30/2023] [Accepted: 02/02/2023] [Indexed: 02/09/2023] Open
Abstract
Type 1 diabetes (T1D) is one of the most common chronic diseases of the endocrine system, associated with several life-threatening comorbidities. While the etiopathogenesis of T1D remains elusive, a combination of genetic susceptibility and environmental factors, such as microbial infections, are thought to be involved in the development of the disease. The prime model for studying the genetic component of T1D predisposition encompasses polymorphisms within the HLA (human leukocyte antigen) region responsible for the specificity of antigen presentation to lymphocytes. Apart from polymorphisms, genomic reorganization caused by repeat elements and endogenous viral elements (EVEs) might be involved in T1D predisposition. Such elements are human endogenous retroviruses (HERVs) and non-long terminal repeat (non-LTR) retrotransposons, including long and short interspersed nuclear elements (LINEs and SINEs). In line with their parasitic origin and selfish behaviour, retrotransposon-imposed gene regulation is a major source of genetic variation and instability in the human genome, and may represent the missing link between genetic susceptibility and environmental factors long thought to contribute to T1D onset. Autoreactive immune cell subtypes with differentially expressed retrotransposons can be identified with single-cell transcriptomics, and personalized assembled genomes can be constructed, which can then serve as a reference for predicting retrotransposon integration/restriction sites. Here we review what is known to date about retrotransposons, we discuss the involvement of viruses and retrotransposons in T1D predisposition, and finally we consider challenges in retrotransposons analysis methods.
Collapse
|
28
|
Addressing the pervasive scarcity of structural annotation in eukaryotic algae. Sci Rep 2023; 13:1687. [PMID: 36717613 PMCID: PMC9886943 DOI: 10.1038/s41598-023-27881-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2022] [Accepted: 01/09/2023] [Indexed: 02/01/2023] Open
Abstract
Despite a continuous increase in algal genome sequencing, structural annotations of most algal genome assemblies remain unavailable. This pervasive scarcity of genome annotation has restricted rigorous investigation of these genomic resources and may have precipitated misleading biological interpretations. However, the annotation process for eukaryotic algal species is often challenging as genomic resources and transcriptomic evidence are not always available. To address this challenge, we benchmark the cutting-edge gene prediction methods that can be generalized for a broad range of non-model eukaryotes. Using the most accurate methods selected based on high-quality algal genomes, we predict structural annotations for 135 unannotated algal genomes. Using previously available genomic data pooled together with new data obtained in this study, we identified the core orthologous genes and the multi-gene phylogeny of eukaryotic algae, including of previously unexplored algal species. This study not only provides a benchmark for the use of structural annotation methods on a variety of non-model eukaryotes, but also compensates for missing data in the current spectrum of algal genomic resources. These results bring us one step closer to the full potential of eukaryotic algal genomics.
Collapse
|
29
|
Assembly and annotation of the Gossypium barbadense L. 'Pima-S6' genome raise questions about the chromosome structure and gene content of Gossypium barbadense genomes. BMC Genomics 2023; 24:11. [PMID: 36627552 PMCID: PMC9830710 DOI: 10.1186/s12864-022-09102-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2022] [Accepted: 12/28/2022] [Indexed: 01/12/2023] Open
Abstract
BACKGROUND Gossypium barbadense L. Pima cotton is known for its resistance to Fusarium wilt and for producing fibers of superior quality highly prized in the textile market. We report a high-quality genome assembly and annotation of Pima-S6 cotton and its comparison at the chromosome and protein level to other ten Gossypium published genome assemblies. RESULTS Synteny and orthogroup analyses revealed important differences on chromosome structure and annotated proteins content between our Pima-S6 and other publicly available G. barbadense assemblies, and across Gossypium assemblies in general. Detailed synteny analyses revealed chromosomal rearrangements between Pima-S6 and other Pima genomes on several chromosomes, with three major inversions in chromosomes A09, A13 and D05, raising questions about the true chromosome structure of Gossypium barbadense genomes. CONCLUSION Analyses of the re-assembled and re-annotated genome of the close relative G. barbadense Pima 3-79 using our Pima-S6 assembly suggest that contig placement of some recent G. barbadense assemblies might have been unduly influenced by the use of the G. hirsutum TM-1 genome as the anchoring reference. The Pima-S6 reference genome provides a valuable genomic resource and offers new insights on genomic structure, and can serve as G. barbadense genome reference for future assemblies and further support FOV4-related studies and breeding efforts.
Collapse
|
30
|
Benchmark study for evaluating the quality of reference genomes and gene annotations in 114 species. Front Vet Sci 2023; 10:1128570. [PMID: 36896291 PMCID: PMC9988948 DOI: 10.3389/fvets.2023.1128570] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2022] [Accepted: 02/02/2023] [Indexed: 02/23/2023] Open
Abstract
Introduction For reference genomes and gene annotations are key materials that can determine the limits of the molecular biology research of a species; however, systematic research on their quality assessment remains insufficient. Methods We collected reference assemblies, gene annotations, and 3,420 RNA-sequencing (RNA-seq) data from 114 species and selected effective indicators to simultaneously evaluate the reference genome quality of various species, including statistics that can be obtained empirically during the mapping process of short reads. Furthermore, we newly presented and applied transcript diversity and quantification success rates that can relatively evaluate the quality of gene annotations of various species. Finally, we proposed a next-generation sequencing (NGS) applicability index by integrating a total of 10 effective indicators that can evaluate the genome and gene annotation of a specific species. Results and discussion Based on these effective evaluation indicators, we successfully evaluated and demonstrated the relative accessibility of NGS applications in all species, which will directly contribute to determining the technological boundaries in each species. Simultaneously, we expect that it will be a key indicator to examine the direction of future development through relative quality evaluation of genomes and gene annotations in each species, including countless organisms whose genomes and gene annotations will be constructed in the future.
Collapse
|
31
|
Phylogenomics provides insights into the evolution of cactophily and host plant shifts in Drosophila. Mol Phylogenet Evol 2023; 178:107653. [PMID: 36404461 DOI: 10.1016/j.ympev.2022.107653] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2022] [Revised: 09/30/2022] [Accepted: 10/25/2022] [Indexed: 11/06/2022]
Abstract
Cactophilic species of the Drosophila buzzatii cluster (repleta group) comprise an excellent model group to investigate genomic changes underlying adaptation to extreme climate conditions and host plants. In particular, these species form a tractable system to study the transition from chemically simpler breeding sites (like prickly pears of the genus Opuntia) to chemically more complex hosts (columnar cacti). Here, we report four highly contiguous genome assemblies of three species of the buzzatii cluster. Based on this genomic data and inferred phylogenetic relationships, we identified candidate taxonomically restricted genes (TRGs) likely involved in the evolution of cactophily and cactus host specialization. Functional enrichment analyses of TRGs within the buzzatii cluster identified genes involved in detoxification, water preservation, immune system response, anatomical structure development, and morphogenesis. In contrast, processes that regulate responses to stress, as well as the metabolism of nitrogen compounds, transport, and secretion were found in the set of species that are columnar cacti dwellers. These findings are in line with the hypothesis that those genomic changes brought about key mechanisms underlying the adaptation of the buzzatii cluster species to arid regions in South America.
Collapse
|
32
|
Genome assembly of 3 Amazonian Morpho butterfly species reveals Z-chromosome rearrangements between closely related species living in sympatry. Gigascience 2022; 12:giad033. [PMID: 37216769 PMCID: PMC10202424 DOI: 10.1093/gigascience/giad033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2022] [Revised: 02/13/2023] [Accepted: 05/12/2023] [Indexed: 05/24/2023] Open
Abstract
The genomic processes enabling speciation and species coexistence in sympatry are still largely unknown. Here we describe the whole-genome sequencing and assembly of 3 closely related species from the butterfly genus Morpho: Morpho achilles (Linnaeus, 1758), Morpho helenor (Cramer, 1776), and Morpho deidamia (Höbner, 1819). These large blue butterflies are emblematic species of the Amazonian rainforest. They live in sympatry in a wide range of their geographical distribution and display parallel diversification of dorsal wing color pattern, suggesting local mimicry. By sequencing, assembling, and annotating their genomes, we aim at uncovering prezygotic barriers preventing gene flow between these sympatric species. We found a genome size of 480 Mb for the 3 species and a chromosomal number ranging from 2n = 54 for M. deidamia to 2n = 56 for M. achilles and M. helenor. We also detected inversions on the sex chromosome Z that were differentially fixed between species, suggesting that chromosomal rearrangements may contribute to their reproductive isolation. The annotation of their genomes allowed us to recover in each species at least 12,000 protein-coding genes and to discover duplications of genes potentially involved in prezygotic isolation like genes controlling color discrimination (L-opsin). Altogether, the assembly and the annotation of these 3 new reference genomes open new research avenues into the genomic architecture of speciation and reinforcement in sympatry, establishing Morpho butterflies as a new eco-evolutionary model.
Collapse
|
33
|
Comparative Genomics for Evolutionary Cell Biology Using AMOEBAE: Understanding the Golgi and Beyond. Methods Mol Biol 2022; 2557:431-452. [PMID: 36512230 DOI: 10.1007/978-1-0716-2639-9_26] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
Taking an evolutionary approach to cell biology can yield important new information about how the cell works and how it evolved to do so. This is true of the Golgi apparatus, as it is of all systems within the cell. Comparative genomics is one of the crucial first steps to this line of research, but comes with technical challenges that must be overcome for rigor and robustness. We here introduce AMOEBAE, a workflow for mid-range scale comparative genomic analyses. It allows for customization of parameters, queries, and taxonomic sampling of genomic and transcriptomics data. This protocol article covers the rationale for an evolutionary approach to cell biological study (i.e., when would AMOEBAE be useful), how to use AMOEBAE, and discussion of limitations. It also provides an example dataset, which demonstrates that the Golgi protein AP4 Epsilon is present as the sole retained subunit of the AP4 complex in basidiomycete fungi. AMOEBAE can facilitate comparative genomic studies by balancing reproducibility and speed with user-input and interpretation. It is hoped that AMOEBAE or similar tools will encourage cell biologists to incorporate an evolutionary context into their research.
Collapse
|
34
|
The Sum of Two Halves May Be Different from the Whole-Effects of Splitting Sequencing Samples Across Lanes. Genes (Basel) 2022; 13:genes13122265. [PMID: 36553532 PMCID: PMC9777937 DOI: 10.3390/genes13122265] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2022] [Revised: 11/23/2022] [Accepted: 11/25/2022] [Indexed: 12/03/2022] Open
Abstract
The advances in high-throughput sequencing (HTS) have enabled the characterisation of biological processes at an unprecedented level of detail; most hypotheses in molecular biology rely on analyses of HTS data. However, achieving increased robustness and reproducibility of results remains a main challenge. Although variability in results may be introduced at various stages, e.g., alignment, summarisation or detection of differential expression, one source of variability was systematically omitted: the sequencing design, which propagates through analyses and may introduce an additional layer of technical variation. We illustrate qualitative and quantitative differences arising from splitting samples across lanes on bulk and single-cell sequencing. For bulk mRNAseq data, we focus on differential expression and enrichment analyses; for bulk ChIPseq data, we investigate the effect on peak calling and the peaks' properties. At the single-cell level, we concentrate on identifying cell subpopulations. We rely on markers used for assigning cell identities; both smartSeq and 10× data are presented. The observed reduction in the number of unique sequenced fragments limits the level of detail on which the different prediction approaches depend. Furthermore, the sequencing stochasticity adds in a weighting bias corroborated with variable sequencing depths and (yet unexplained) sequencing bias. Subsequently, we observe an overall reduction in sequencing complexity and a distortion in the biological signal across technologies, experimental contexts, organisms and tissues.
Collapse
|
35
|
Genome sequence and silkomics of the spindle ermine moth, Yponomeuta cagnagella, representing the early diverging lineage of the ditrysian Lepidoptera. Commun Biol 2022; 5:1281. [PMID: 36418465 PMCID: PMC9684489 DOI: 10.1038/s42003-022-04240-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2022] [Accepted: 11/09/2022] [Indexed: 11/24/2022] Open
Abstract
Many lepidopteran species produce silk, cocoons, feeding tubes, or nests for protection from predators and parasites for caterpillars and pupae. Yet, the number of lepidopteran species whose silk composition has been studied in detail is very small, because the genes encoding the major structural silk proteins tend to be large and repetitive, making their assembly and sequence analysis difficult. Here we have analyzed the silk of Yponomeuta cagnagella, which represents one of the early diverging lineages of the ditrysian Lepidoptera thus improving the coverage of the order. To obtain a comprehensive list of the Y. cagnagella silk genes, we sequenced and assembled a draft genome using Oxford Nanopore and Illumina technologies. We used a silk-gland transcriptome and a silk proteome to identify major silk components and verified the tissue specificity of expression of individual genes. A detailed annotation of the major genes and their putative products, including their complete sequences and exon-intron structures is provided. The morphology of silk glands and fibers are also shown. This study fills an important gap in our growing understanding of the structure, evolution, and function of silk genes and provides genomic resources for future studies of the chemical ecology of Yponomeuta species.
Collapse
|
36
|
AnnotaPipeline: An integrated tool to annotate eukaryotic proteins using multi-omics data. Front Genet 2022; 13:1020100. [DOI: 10.3389/fgene.2022.1020100] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2022] [Accepted: 11/11/2022] [Indexed: 11/23/2022] Open
Abstract
Assignment of gene function has been a crucial, laborious, and time-consuming step in genomics. Due to a variety of sequencing platforms that generates increasing amounts of data, manual annotation is no longer feasible. Thus, the need for an integrated, automated pipeline allowing the use of experimental data towards validation of in silico prediction of gene function is of utmost relevance. Here, we present a computational workflow named AnnotaPipeline that integrates distinct software and data types on a proteogenomic approach to annotate and validate predicted features in genomic sequences. Based on FASTA (i) nucleotide or (ii) protein sequences or (iii) structural annotation files (GFF3), users can input FASTQ RNA-seq data, MS/MS data from mzXML or similar formats, as the pipeline uses both transcriptomic and proteomic information to corroborate annotations and validate gene prediction, providing transcription and expression evidence for functional annotation. Reannotation of the available Arabidopsis thaliana, Caenorhabditis elegans, Candida albicans, Trypanosoma cruzi, and Trypanosoma rangeli genomes was performed using the AnnotaPipeline, resulting in a higher proportion of annotated proteins and a reduced proportion of hypothetical proteins when compared to the annotations publicly available for these organisms. AnnotaPipeline is a Unix-based pipeline developed using Python and is available at: https://github.com/bioinformatics-ufsc/AnnotaPipeline.
Collapse
|
37
|
The economics and policy of genome editing in crop improvement. THE PLANT GENOME 2022:e20248. [PMID: 36321718 DOI: 10.1002/tpg2.20248] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/18/2022] [Accepted: 05/26/2022] [Indexed: 06/16/2023]
Abstract
In this review article we analyze the economics of genome editing and its potential long-term effect on crop improvement and agriculture. We describe the emergence of genome editing as a novel platform for crop improvement, distinct from the existing platforms of plant breeding and genetic engineering. We review key technical characteristics of genome editing and describe how it enables faster trait development, lower research and development costs, and the development of novel traits not possible through previous crop improvement methods. Given these fundamental technical and economic advantages, we describe how genome editing can greatly increase the productivity and broaden the scope of crop improvement with potential outsized economic effects. We further discuss how the global regulatory policy environment, which is still emerging, can shape the ultimate path of genome editing innovation, its effect on crop improvement, and its overall socioeconomic benefits to society.
Collapse
|
38
|
Propagation, detection and correction of errors using the sequence database network. Brief Bioinform 2022; 23:6764545. [PMID: 36266246 PMCID: PMC9677457 DOI: 10.1093/bib/bbac416] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2022] [Revised: 07/31/2022] [Accepted: 08/28/2022] [Indexed: 12/14/2022] Open
Abstract
Nucleotide and protein sequences stored in public databases are the cornerstone of many bioinformatics analyses. The records containing these sequences are prone to a wide range of errors, including incorrect functional annotation, sequence contamination and taxonomic misclassification. One source of information that can help to detect errors are the strong interdependency between records. Novel sequences in one database draw their annotations from existing records, may generate new records in multiple other locations and will have varying degrees of similarity with existing records across a range of attributes. A network perspective of these relationships between sequence records, within and across databases, offers new opportunities to detect-or even correct-erroneous entries and more broadly to make inferences about record quality. Here, we describe this novel perspective of sequence database records as a rich network, which we call the sequence database network, and illustrate the opportunities this perspective offers for quantification of database quality and detection of spurious entries. We provide an overview of the relevant databases and describe how the interdependencies between sequence records across these databases can be exploited by network analyses. We review the process of sequence annotation and provide a classification of sources of error, highlighting propagation as a major source. We illustrate the value of a network perspective through three case studies that use network analysis to detect errors, and explore the quality and quantity of critical relationships that would inform such network analyses. This systematic description of a network perspective of sequence database records provides a novel direction to combat the proliferation of errors within these critical bioinformatics resources.
Collapse
|
39
|
Prediction of transcript isoforms in 19 chicken tissues by Oxford Nanopore long-read sequencing. Front Genet 2022; 13:997460. [PMID: 36246588 PMCID: PMC9561881 DOI: 10.3389/fgene.2022.997460] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2022] [Accepted: 08/30/2022] [Indexed: 11/22/2022] Open
Abstract
To identify and annotate transcript isoforms in the chicken genome, we generated Nanopore long-read sequencing data from 68 samples that encompassed 19 diverse tissues collected from experimental adult male and female White Leghorn chickens. More than 23.8 million reads with mean read length of 790 bases and average quality of 18.2 were generated. The annotation and subsequent filtering resulted in the identification of 55,382 transcripts at 40,547 loci with mean length of 1,700 bases. We predicted 30,967 coding transcripts at 19,461 loci, and 16,495 lncRNA transcripts at 15,512 loci. Compared to existing reference annotations, we found ∼52% of annotated transcripts could be partially or fully matched while ∼47% were novel. Seventy percent of novel transcripts were potentially transcribed from lncRNA loci. Based on our annotation, we quantified transcript expression across tissues and found two brain tissues (i.e., cerebellum and cortex) expressed the highest number of transcripts and loci. Furthermore, ∼22% of the transcripts displayed tissue specificity with the reproductive tissues (i.e., testis and ovary) exhibiting the most tissue-specific transcripts. Despite our wide sampling, ∼20% of Ensembl reference loci were not detected. This suggests that deeper sequencing and additional samples that include different breeds, cell types, developmental stages, and physiological conditions, are needed to fully annotate the chicken genome. The application of Nanopore sequencing in this study demonstrates the usefulness of long-read data in discovering additional novel loci (e.g., lncRNA loci) and resolving complex transcripts (e.g., the longest transcript for the TTN locus).
Collapse
|
40
|
First Genome of Rock Lizard Darevskia valentini Involved in Formation of Several Parthenogenetic Species. Genes (Basel) 2022; 13:genes13091569. [PMID: 36140737 PMCID: PMC9498476 DOI: 10.3390/genes13091569] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2022] [Revised: 08/22/2022] [Accepted: 08/27/2022] [Indexed: 11/22/2022] Open
Abstract
The extant reptiles are one of the most diverse clades among terrestrial vertebrates and one of a few groups with instances of parthenogenesis. Due to the hybrid origin of parthenogenetic species, reference genomes of the parental species as well as of the parthenogenetic progeny are indispensable to explore the genetic foundations of parthenogenetic reproduction. Here, we report on the first genome assembly of rock lizard Darevskia valentini, a paternal species for several parthenogenetic lineages. The novel genome was used in the reconstruction of the comprehensive phylogeny of Squamata inferred independently from 7369 trees of single-copy orthologs and a supermatrix of 378 conserved proteins. We also investigated Hox clusters, the loci that are often regarded as playing an important role in the speciation of animal groups with drastically diverse morphology. We demonstrated that Hox clusters of D. valentini are invaded with transposons and contain the HoxC1 gene that has been considered to be lost in the amniote ancestor. This study provides confirmation for previous works and releases new genomic data that will contribute to future discoveries on the mechanisms of parthenogenesis as well as support comparative studies among reptiles.
Collapse
|
41
|
Multi-omics analyses reveal MdMYB10 hypermethylation being responsible for a bud sport of apple fruit color. HORTICULTURE RESEARCH 2022; 9:uhac179. [PMID: 36338840 PMCID: PMC9627520 DOI: 10.1093/hr/uhac179] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/10/2022] [Accepted: 08/02/2022] [Indexed: 06/16/2023]
Abstract
Apple bud sports offer a rich resource for clonal selection of numerous elite cultivars. The accumulation of somatic mutations as plants develop may potentially impact the emergence of bud sports. Previous studies focused on somatic mutation in the essential genes associated with bud sports. However, the rate and function of genome-wide somatic mutations that accumulate when a bud sport arises remain unclear. In this study, we identified a branch from a 10-year-old tree of the apple cultivar 'Oregon Spur II' as a bud sport. The mutant branch showed reduced red coloration on fruit skin. Using this plant material, we assembled a high-quality haplotype reference genome consisting of 649.61 Mb sequences with a contig N50 value of 2.04 Mb. We then estimated the somatic mutation rate of the apple tree to be 4.56 × 10 -8 per base per year, and further identified 253 somatic single-nucleotide polymorphisms (SNPs), including five non-synonymous SNPs, between the original type and mutant samples. Transcriptome analyses showed that 69 differentially expressed genes between the original type and mutant fruit skin were highly correlated with anthocyanin content. DNA methylation in the promoter of five anthocyanin-associated genes was increased in the mutant compared with the original type as determined using DNA methylation profiling. Among the genetic and epigenetic factors that directly and indirectly influence anthocyanin content in the mutant apple fruit skin, the hypermethylated promoter of MdMYB10 is important. This study indicated that numerous somatic mutations accumulated at the emergence of a bud sport from a genome-wide perspective, some of which contribute to the low coloration of the bud sport.
Collapse
|
42
|
Novel Method of Full-Length RNA-seq That Expands the Identification of Non-Polyadenylated RNAs Using Nanopore Sequencing. Anal Chem 2022; 94:12342-12351. [PMID: 36018770 DOI: 10.1021/acs.analchem.2c01128] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
The occurrence of diseases displayed transcriptome alteration, including both coding and non-coding transcripts. The third-generation sequencing (TGS) technologies allow for intensive and comprehensive research of the transcriptome. However, the present standard TGS RNA sequencing method is unable to detect many of the non-polyadenylated [non-poly(A)] RNAs. To obtain more complete transcriptome information, we presented a new comprehensive sequencing approach by performing conventional poly(A) RNA-sequencing combined with the sequencing of non-poly(A) RNA fraction which was tailed by poly(U) on HepG2 and HL-7702 cell lines, enabling the detection of multiple categories of non-poly(A) RNAs excluded by the existing standard approach. Moreover, the length distribution of the full-splice match transcripts was longer than that assembled by short-reads, which contributed to characterizing alternative splicing events and provided a comprehensive portrait of transcriptional complexity. Besides the detection of genes with differential expression patterns in the poly(A) library between HepG2 and HL-7702, we also found a cancer-related non-coding gene in the poly(U) data, that is, growth arrest special 5 (GAS5). Collectively, our results suggested that the novel method effectively captured both poly(A) and non-poly(A) transcripts in the tested cell lines and allowed a deeper exploration of the transcriptome.
Collapse
|
43
|
Innovative Hybrid-Alignment Annotation Method for Bioinformatics Identification and Functional Verification of a Novel Nitric Oxide Synthase in Trichomonas vaginalis. BIOLOGY 2022; 11:biology11081210. [PMID: 36009837 PMCID: PMC9404748 DOI: 10.3390/biology11081210] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/12/2022] [Revised: 08/06/2022] [Accepted: 08/08/2022] [Indexed: 11/17/2022]
Abstract
Simple Summary Both the annotation and identification of genes in pathogenic parasites remain challenging. As a survival factor, nitric oxide (NO) has been proven to be synthesized in Trichomonas vaginalis (TV). However, nitric oxide synthase (NOS) has not yet been annotated in the TV genome. By aligning whole coding sequences of TV against a thousand sequences of known proteins from other organisms via the Smith–Waterman and Needleman–Wunsch algorithms, we developed a witness-to-suspect strategy to identify incorrectly annotated genes in TV. A novel NOS of TV (TV NOS) with a high witness-to-suspect ratio, which was originally annotated as a hydrogenase in the NCBI database, was successfully identified. We then performed in silico modeling of the protein structure and the molecular docking of all cofactors (NADPH, tetrahydrobiopterin (BH4), heme and flavin adenine dinucleotide (FAD)), cloned the gene, expressed and purified the protein, and ultimately performed mass spectrometry analysis and enzymatic activity assays. We clearly showed that although the predicted structure of TV NOS is not similar to that of NOS proteins of other species, all cofactor-binding motifs can interact with their ligands with high affinities. Most importantly, the purified protein is a functional NOS, as it has a high enzymatic activity for generating NO in vitro. This study provides an innovative approach to identify incorrectly annotated genes. Abstract Both the annotation and identification of genes in pathogenic parasites are still challenging. Although, as a survival factor, nitric oxide (NO) has been proven to be synthesized in Trichomonas vaginalis (TV), nitric oxide synthase (NOS) has not yet been annotated in the TV genome. We developed a witness-to-suspect strategy to identify incorrectly annotated genes in TV via the Smith–Waterman and Needleman–Wunsch algorithms through in-depth and repeated alignment of whole coding sequences of TV against thousands of sequences of known proteins from other organisms. A novel NOS of TV (TV NOS), which was annotated as hydrogenase in the NCBI database, was successfully identified; this TV NOS had a high witness-to-suspect ratio and contained all the NOS cofactor-binding motifs (NADPH, tetrahydrobiopterin (BH4), heme and flavin adenine dinucleotide (FAD) motifs). To confirm this identification, we performed in silico modeling of the protein structure and cofactor docking, cloned the gene, expressed and purified the protein, performed mass spectrometry analysis, and ultimately performed an assay to measure enzymatic activity. Our data showed that although the predicted structure of the TV NOS protein was not similar to the structure of NOSs of other species, all cofactor-binding motifs could interact with their ligands with high affinities. We clearly showed that the purified protein had high enzymatic activity for generating NO in vitro. This study provides an innovative approach to identify incorrectly annotated genes in TV and highlights a novel NOS that might serve as a virulence factor of TV.
Collapse
|
44
|
Deep learning identifies and quantifies recombination hotspot determinants. Bioinformatics 2022; 38:2683-2691. [PMID: 35561158 PMCID: PMC9113300 DOI: 10.1093/bioinformatics/btac234] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2021] [Revised: 03/08/2022] [Accepted: 04/08/2022] [Indexed: 11/30/2022] Open
Abstract
MOTIVATION Recombination is one of the essential genetic processes for sexually reproducing organisms, which can happen more frequently in some regions, called recombination hotspots. Although several factors, such as PRDM9 binding motifs, are known to be related to the hotspots, their contributions to the recombination hotspots have not been quantified, and other determinants are yet to be elucidated. Here, we propose a computational method, RHSNet, based on deep learning and signal processing, to identify and quantify the hotspot determinants in a purely data-driven manner, utilizing datasets from various studies, populations, sexes and species. RESULTS RHSNet can significantly outperform other sequence-based methods on multiple datasets across different species, sexes and studies. In addition to being able to identify hotspot regions and the well-known determinants accurately, more importantly, RHSNet can quantify the determinants that contribute significantly to the recombination hotspot formation in the relation between PRDM9 binding motif, histone modification and GC content. Further cross-sex, cross-population and cross-species studies suggest that the proposed method has the generalization power and potential to identify and quantify the evolutionary determinant motifs. AVAILABILITY AND IMPLEMENTATION https://github.com/frankchen121212/RHSNet. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
45
|
Mixing genome annotation methods in a comparative analysis inflates the apparent number of lineage-specific genes. Curr Biol 2022; 32:2632-2639.e2. [PMID: 35588743 DOI: 10.1016/j.cub.2022.04.085] [Citation(s) in RCA: 19] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2022] [Revised: 03/17/2022] [Accepted: 04/21/2022] [Indexed: 12/16/2022]
Abstract
Comparisons of genomes of different species are used to identify lineage-specific genes, those genes that appear unique to one species or clade. Lineage-specific genes are often thought to represent genetic novelty that underlies unique adaptations. Identification of these genes depends not only on genome sequences, but also on inferred gene annotations. Comparative analyses typically use available genomes that have been annotated using different methods, increasing the risk that orthologous DNA sequences may be erroneously annotated as a gene in one species but not another, appearing lineage specific as a result. To evaluate the impact of such "annotation heterogeneity," we identified four clades of species with sequenced genomes with more than one publicly available gene annotation, allowing us to compare the number of lineage-specific genes inferred when differing annotation methods are used to those resulting when annotation method is uniform across the clade. In these case studies, annotation heterogeneity increases the apparent number of lineage-specific genes by up to 15-fold, suggesting that annotation heterogeneity is a substantial source of potential artifact.
Collapse
|
46
|
The Genome of the Marine Rotifer Brachionus manjavacas: Genome-Wide Identification of 310 G Protein-Coupled Receptor (GPCR) Genes. MARINE BIOTECHNOLOGY (NEW YORK, N.Y.) 2022; 24:226-242. [PMID: 35262805 DOI: 10.1007/s10126-022-10102-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/22/2021] [Accepted: 02/11/2022] [Indexed: 06/14/2023]
Abstract
The marine rotifer Brachionus manjavacas is widely used in ecological, ecotoxicological, and ecophysiological studies. The reference genome of B. manjavacas is a good starting point to uncover the potential molecular mechanisms of responses to various environmental stressors. In this study, we assembled the whole-genome sequence (114.1 Mb total, N50 = 6.36 Mb) of B. manjavacas, consisting of 61 contigs with 18,527 annotated genes. To elucidate the potential ligand-receptor signaling pathways in marine Brachionus rotifers in response to environmental signals, we identified 310 G protein-coupled receptor (GPCR) genes in the B. manjavacas genome after comparing them with three other species, including the minute rotifer Proales similis, Drosophila melanogaster, and humans (Homo sapiens). The 310 full-length GPCR genes were categorized into five distinct classes: A (262), B (26), C (7), F (2), and other (13). Most GPCR gene families showed sporadic evolutionary processes, but some classes were highly conserved between species as shown in the minute rotifer P. similis. Overall, these results provide potential clues for in silico analysis of GPCR-based signaling pathways in the marine rotifer B. manjavacas and will expand our knowledge of ligand-receptor signaling pathways in response to various environmental signals in rotifers.
Collapse
|
47
|
TreeTuner: A pipeline for minimizing redundancy and complexity in large phylogenetic datasets. STAR Protoc 2022; 3:101175. [PMID: 35243369 PMCID: PMC8857567 DOI: 10.1016/j.xpro.2022.101175] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
Various bioinformatics protocols have been developed for trimming the number of operational taxonomic units (OTUs) in phylogenetic datasets, but they typically require significant manual intervention. Here we present TreeTuner, a semiautomated pipeline that allows both coarse and fine-scale tuning of large protein sequence phylogenetic datasets via the minimization of OTU redundancy. TreeTuner facilitates preliminary investigation of such datasets as well as more rigorous downstream analysis of specific subsets of OTUs. For complete details on the use and execution of this protocol, please refer to Maruyama et al. (2013) and Sibbald et al. (2019). Minimizes sequence redundancy in large phylogenetic datasets Trims thousands of operational taxonomic units (OTUs) from a preliminary tree Maintains the desired minimal taxonomic diversity and retains specific OTUs for analysis Available coarse- and fine-tuning options depending on the phylogenetic question at hand
Collapse
|
48
|
The freshwater water flea Daphnia magna NIES strain genome as a resource for CRISPR/Cas9 gene targeting: The glutathione S-transferase omega 2 gene. AQUATIC TOXICOLOGY 2022; 242:106021. [PMID: 34856461 DOI: 10.1016/j.aquatox.2021.106021] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/09/2021] [Revised: 10/26/2021] [Accepted: 11/07/2021] [Indexed: 02/07/2023]
Abstract
The water flea Daphnia magna is a small freshwater planktonic animal in the Cladocera. In this study, we assembled the genome of the D. magna NIES strain, which is widely used for gene targeting but has no reported genome. We used the long-read sequenced data of the Oxford nanopore sequencing tool for assembly. Using 3,231 genetic markers, the draft genome of the D. magna NIES strain was built into ten linkage groups (LGs) with 483 unanchored contigs, comprising a genome size of 173.47 Mb. The N50 value of the genome was 12.54 Mb and the benchmarking universal single-copy ortholog value was 98.8%. Repeat elements in the D. magna NIES genome were 40.8%, which was larger than other Daphnia spp. In the D. magna NIES genome, 15,684 genes were functionally annotated. To assess the genome of the D. magna NIES strain for CRISPR/Cas9 gene targeting, we selected glutathione S-transferase omega 2 (GST-O2), which is an important gene for the biotransformation of arsenic in aquatic organisms, and targeted it with an efficient make-up (25.0%) of mutant lines. In addition, we measured reactive oxygen species and antioxidant enzymatic activity between wild type and a mutant of the GST-O2 targeted D. magna NIES strain in response to arsenic. In this study, we present the genome of the D. magna NIES strain using GST-O2 as an example of gene targeting, which will contribute to the construction of deletion mutants by CRISPR/Cas9 technology.
Collapse
|
49
|
Production of mannosylerythritol lipids: biosynthesis, multi-omics approaches, and commercial exploitation. Mol Omics 2022; 18:699-715. [DOI: 10.1039/d2mo00150k] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Compilation of resources regarding MEL biosynthesis, key production parameters; available omics resources and current commercial applications, for smut fungi known to produce MELs.
Collapse
|
50
|
Annotation of Protein-Coding Genes in Plant Genomes. Methods Mol Biol 2022; 2443:309-326. [PMID: 35037214 DOI: 10.1007/978-1-0716-2067-0_17] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Advances in next-generation sequencing technologies and the lower sequencing costs are paving the way to more plant genome sequencing, assembly, and annotation projects. While genome assembly is the first step toward elucidating the genome structure of a species, it is the annotation of the protein-coding genes that provide meaningful information to biologists. However, genome annotation is not a trivial task. Therefore, the aim of this chapter is to provide a detailed view of this important process, including tools and commands that can be used to carry out such a process.
Collapse
|