1
|
Wong TKF, Cherryh C, Rodrigo AG, Hahn MW, Minh BQ, Lanfear R. MAST: Phylogenetic Inference with Mixtures Across Sites and Trees. Syst Biol 2024:syae008. [PMID: 38421146 DOI: 10.1093/sysbio/syae008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2022] [Indexed: 03/02/2024] Open
Abstract
Hundreds or thousands of loci are now routinely used in modern phylogenomic studies. Concatenation approaches to tree inference assume that there is a single topology for the entire dataset, but different loci may have different evolutionary histories due to incomplete lineage sorting, introgression, and/or horizontal gene transfer; even single loci may not be treelike due to recombination. To overcome this shortcoming, we introduce an implementation of a multi-tree mixture model that we call MAST. This model extends a prior implementation by Boussau et al. (2009) by allowing users to estimate the weight of each of a set of pre-specified bifurcating trees in a single alignment. The MAST model allows each tree to have its own weight, topology, branch lengths, substitution model, nucleotide or amino acid frequencies, and model of rate heterogeneity across sites. We implemented the MAST model in a maximum-likelihood framework in the popular phylogenetic software, IQ-TREE. Simulations show that we can accurately recover the true model parameters, including branch lengths and tree weights for a given set of tree topologies, under a wide range of biologically realistic scenarios. We also show that we can use standard statistical inference approaches to reject a single-tree model when data are simulated under multiple trees (and vice versa). We applied the MAST model to multiple primate datasets and found that it can recover the signal of incomplete lineage sorting in the Great Apes, as well as the asymmetry in minor trees caused by introgression among several macaque species. When applied to a dataset of four Platyrrhine species for which standard concatenated maximum likelihood and gene tree approaches disagree, we observe that MAST gives the highest weight (i.e. the largest proportion of sites) to the tree also supported by gene tree approaches. These results suggest that the MAST model is able to analyse a concatenated alignment using maximum likelihood, while avoiding some of the biases that come with assuming there is only a single tree. We discuss how the MAST model can be extended in the future.
Collapse
Affiliation(s)
- Thomas K F Wong
- School of Computing, Australian National University, Canberra, Australian Capital Territory 2601, Australia
| | - Caitlin Cherryh
- Research School of Biology, Australian National University, Canberra, ACT 2601, Australia
| | - Allen G Rodrigo
- School of Biological Sciences, University of Auckland, Auckland, New Zealand
| | - Matthew W Hahn
- Department of Biology and Department of Computer Science, Indiana University, Bloomington, Indiana, United States of America
| | - Bui Quang Minh
- School of Computing, Australian National University, Canberra, Australian Capital Territory 2601, Australia
| | - Robert Lanfear
- Research School of Biology, Australian National University, Canberra, ACT 2601, Australia
| |
Collapse
|
2
|
Li X, Zou Y, Li T, Wong TKF, Bushey RT, Campa MJ, Gottlin EB, Liu H, Wei Q, Rodrigo A, Patz EF. Genetic Variants of CLPP and M1AP Are Associated With Risk of Non-Small Cell Lung Cancer. Front Oncol 2021; 11:709829. [PMID: 34604049 PMCID: PMC8479179 DOI: 10.3389/fonc.2021.709829] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2021] [Accepted: 08/20/2021] [Indexed: 11/23/2022] Open
Abstract
Background Single nucleotide polymorphisms (SNPs) are often associated with distinct phenotypes in cancer. The present study investigated associations of cancer risk and outcomes with SNPs discovered by whole exome sequencing of normal lung tissue DNA of 15 non-small cell lung cancer (NSCLC) patients, 10 early stage and 5 advanced stage. Methods DNA extracted from normal lung tissue of the 15 NSCLC patients was subjected to whole genome amplification and sequencing and analyzed for the occurrence of SNPs. The association of SNPs with the risk of lung cancer and survival was surveyed using the OncoArray study dataset of 85,716 patients (29,266 cases and 56,450 cancer-free controls) and the Prostate, Lung, Colorectal and Ovarian study subset of 1,175 lung cancer patients. Results We identified 4 SNPs exclusive to the 5 patients with advanced stage NSCLC: rs10420388 and rs10418574 in the CLPP gene, and rs11126435 and rs2021725 in the M1AP gene. The variant alleles G of SNP rs10420388 and A of SNP rs10418574 in the CLPP gene were associated with increased risk of squamous cell carcinoma (OR = 1.07 and 1.07; P = 0.013 and 0.016, respectively). The variant allele T of SNP rs11126435 in the M1AP gene was associated with decreased risk of adenocarcinoma (OR = 0.95; P = 0.027). There was no significant association of these SNPs with the overall survival of lung cancer patients (P > 0.05). Conclusions SNPs identified in the CLPP and M1AP genes may be useful in risk prediction models for lung cancer. The previously established association of the CLPP gene with cancer progression lends relevance to our findings.
Collapse
Affiliation(s)
- Xianghan Li
- Research School of Biology, Australian National University, Canberra, ACT, Australia.,School of Biological Sciences, University of Auckland, Auckland, New Zealand
| | - Yiran Zou
- Research School of Biology, Australian National University, Canberra, ACT, Australia.,School of Biological Sciences, University of Auckland, Auckland, New Zealand
| | - Teng Li
- Research School of Biology, Australian National University, Canberra, ACT, Australia.,School of Biological Sciences, University of Auckland, Auckland, New Zealand
| | - Thomas K F Wong
- Research School of Biology, Australian National University, Canberra, ACT, Australia
| | - Ryan T Bushey
- Department of Radiology, Duke University Medical Center, Durham, NC, United States
| | - Michael J Campa
- Department of Radiology, Duke University Medical Center, Durham, NC, United States
| | - Elizabeth B Gottlin
- Department of Radiology, Duke University Medical Center, Durham, NC, United States
| | - Hongliang Liu
- Duke Cancer Institute, Duke University Medical Center, Durham, NC, United States.,Department of Population Health Sciences, Duke University School of Medicine, Durham, NC, United States
| | - Qingyi Wei
- Duke Cancer Institute, Duke University Medical Center, Durham, NC, United States.,Department of Population Health Sciences, Duke University School of Medicine, Durham, NC, United States.,Department of Medicine, Duke University School of Medicine, Durham, NC, United States
| | - Allen Rodrigo
- Research School of Biology, Australian National University, Canberra, ACT, Australia.,School of Biological Sciences, University of Auckland, Auckland, New Zealand
| | - Edward F Patz
- Department of Radiology, Duke University Medical Center, Durham, NC, United States.,Duke Cancer Institute, Duke University Medical Center, Durham, NC, United States.,Department of Pharmacology and Cancer Biology, Duke University Medical Center, Durham, NC, United States
| |
Collapse
|
3
|
Li T, Wong TKF, Ranjard L, Rodrigo AG. pgHMA: Application of the heteroduplex mobility assay analysis in phylogenetics and population genetics. Mol Ecol Resour 2021; 22:653-663. [PMID: 34551204 DOI: 10.1111/1755-0998.13508] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2021] [Revised: 09/01/2021] [Accepted: 09/06/2021] [Indexed: 11/26/2022]
Abstract
The heteroduplex mobility assay (HMA) has proven to be a robust tool for the detection of genetic variation. Here, we describe a simple and rapid application of the HMA by microfluidic capillary electrophoresis, for phylogenetics and population genetic analyses (pgHMA). We show how commonly applied techniques in phylogenetics and population genetics have equivalents with pgHMA: phylogenetic reconstruction with bootstrapping, skyline plots, and mismatch distribution analysis. We assess the performance and accuracy of pgHMA by comparing the results obtained against those obtained using standard methods of analyses applied to sequencing data. The resulting comparisons demonstrate that: (a) there is a significant linear relationship (R2 = .992) between heteroduplex mobility and genetic distance, (b) phylogenetic trees obtained by HMA and nucleotide sequences present nearly identical topologies, (c) clades with high pgHMA parametric bootstrap support also have high bootstrap support on nucleotide phylogenies, (d) skyline plots estimated from the UPGMA trees of HMA and Bayesian trees of nucleotide data reveal similar trends, especially for the median trend estimate of effective population size, and (e) optimized mismatch distributions of HMA are closely fitted to the mismatch distributions of nucleotide sequences. In summary, pgHMA is an easily-applied method for approximating phylogenetic diversity and population trends.
Collapse
Affiliation(s)
- Teng Li
- Research School of Biology, Australian National University, Canberra, ACT, Australia.,School of Biological Sciences, University of Auckland, Auckland, New Zealand
| | - Thomas K F Wong
- Research School of Biology, Australian National University, Canberra, ACT, Australia
| | - Louis Ranjard
- Research School of Biology, Australian National University, Canberra, ACT, Australia.,PlantTech Research Institute, Tauranga, New Zealand
| | - Allen G Rodrigo
- Research School of Biology, Australian National University, Canberra, ACT, Australia.,School of Biological Sciences, University of Auckland, Auckland, New Zealand
| |
Collapse
|
4
|
Wong TKF, Li T, Ranjard L, Wu SH, Sukumaran J, Rodrigo AG. An assembly-free method of phylogeny reconstruction using short-read sequences from pooled samples without barcodes. PLoS Comput Biol 2021; 17:e1008949. [PMID: 34516547 PMCID: PMC8460051 DOI: 10.1371/journal.pcbi.1008949] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2021] [Revised: 09/23/2021] [Accepted: 09/01/2021] [Indexed: 12/01/2022] Open
Abstract
A current strategy for obtaining haplotype information from several individuals involves short-read sequencing of pooled amplicons, where fragments from each individual is identified by a unique DNA barcode. In this paper, we report a new method to recover the phylogeny of haplotypes from short-read sequences obtained using pooled amplicons from a mixture of individuals, without barcoding. The method, AFPhyloMix, accepts an alignment of the mixture of reads against a reference sequence, obtains the single-nucleotide-polymorphisms (SNP) patterns along the alignment, and constructs the phylogenetic tree according to the SNP patterns. AFPhyloMix adopts a Bayesian inference model to estimate the phylogeny of the haplotypes and their relative abundances, given that the number of haplotypes is known. In our simulations, AFPhyloMix achieved at least 80% accuracy at recovering the phylogenies and relative abundances of the constituent haplotypes, for mixtures with up to 15 haplotypes. AFPhyloMix also worked well on a real data set of kangaroo mitochondrial DNA sequences. In evolutionary studies, it is customary to obtain homologous sequences from different individuals in a population or a species to construct a phylogeny. Frequently, sequences from different individuals will be identical; we refer to a set of identical sequences as a haplotype. If short-read sequencing technologies are used to obtain sequences from many individuals, the sequence from each individual is tagged with a unique barcode, and a mixed sample of tagged sequences is subsequently sequenced. The tagged sequences can be identified using the appropriate bioinformatics tools, for further downstream analyses. We have developed a novel method, AFPhyloMix, to reconstruct the phylogeny of a mixed sample of homologous sequences, and the relative abundance of different haplotypes, from different individuals without the need for barcoding. AFPhyloMix aligns the short reads obtained to a reference alignment, and identifies the variable sites along the alignment. On the basis of the patterns of nucleotide frequencies at these and neighbouring sites, AFPhyloMix uses a Bayesian inference model to compute the phylogenetic tree and the haplotype relative abundances. Our results show that AFPhyloMix works well on both the simulated data set and the real data set.
Collapse
Affiliation(s)
- Thomas K. F. Wong
- The Research School of Biology, The Australian National University, ACT, Australia
- * E-mail: (TKFW); (AGR)
| | - Teng Li
- The Research School of Biology, The Australian National University, ACT, Australia
- School of Biological Sciences, University of Auckland, Auckland, New Zealand
| | - Louis Ranjard
- The Research School of Biology, The Australian National University, ACT, Australia
- PlantTech Research Institute, Tauranga, New Zealand
| | - Steven H. Wu
- Department of Agronomy, National Taiwan University, Taipei, Taiwan
| | - Jeet Sukumaran
- Biology Department, San Diego State University, San Diego, California, United States of America
| | - Allen G. Rodrigo
- The Research School of Biology, The Australian National University, ACT, Australia
- School of Biological Sciences, University of Auckland, Auckland, New Zealand
- * E-mail: (TKFW); (AGR)
| |
Collapse
|
5
|
Wong TKF, Kalyaanamoorthy S, Meusemann K, Yeates DK, Misof B, Jermiin LS. A minimum reporting standard for multiple sequence alignments. NAR Genom Bioinform 2020; 2:lqaa024. [PMID: 33575581 PMCID: PMC7671350 DOI: 10.1093/nargab/lqaa024] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2020] [Revised: 03/12/2020] [Accepted: 03/30/2020] [Indexed: 12/19/2022] Open
Abstract
Multiple sequence alignments (MSAs) play a pivotal role in studies of molecular sequence data, but nobody has developed a minimum reporting standard (MRS) to quantify the completeness of MSAs in terms of completely specified nucleotides or amino acids. We present an MRS that relies on four simple completeness metrics. The metrics are implemented in AliStat, a program developed to support the MRS. A survey of published MSAs illustrates the benefits and unprecedented transparency offered by the MRS.
Collapse
Affiliation(s)
- Thomas K F Wong
- Land & Water, CSIRO, Canberra, ACT 2601, Australia
- Research School of Biology, Australian National University, Canberra, ACT 2600, Australia
| | - Subha Kalyaanamoorthy
- Land & Water, CSIRO, Canberra, ACT 2601, Australia
- Department of Chemistry, University of Waterloo, Waterloo, ON N2L 3G1, Canada
| | - Karen Meusemann
- Australian National Insect Collection, CSIRO National Research Collections Australia, Canberra, ACT 2601, Australia
- Zoologisches Forschungsmuseum Alexander Koenig, 53113 Bonn, Germany
- Evolutionsbiologie & Ökologie, Institut für Biologie I, Albert-Ludwigs-Universität Freiburg, 79085 Freiburg im Breisgau, Germany
| | - David K Yeates
- Australian National Insect Collection, CSIRO National Research Collections Australia, Canberra, ACT 2601, Australia
| | - Bernhard Misof
- Zoologisches Forschungsmuseum Alexander Koenig, 53113 Bonn, Germany
| | - Lars S Jermiin
- Land & Water, CSIRO, Canberra, ACT 2601, Australia
- Research School of Biology, Australian National University, Canberra, ACT 2600, Australia
- School of Biology and Environmental Science, University College Dublin, Belfield, Dublin 4, Ireland
- Earth Institute, University College Dublin, Belfield, Dublin 4 Ireland
- To whom correspondence should be addressed.
| |
Collapse
|
6
|
Ranjard L, Wong TKF, Rodrigo AG. Correction to: Effective machine-learning assembly for next-generation amplicon sequencing with very low coverage. BMC Bioinformatics 2020; 21:24. [PMID: 31969110 PMCID: PMC6977291 DOI: 10.1186/s12859-019-3318-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/30/2023] Open
Affiliation(s)
- Louis Ranjard
- The Research School of Biology, The Australian National University, Canberra, Australia.
| | - Thomas K F Wong
- The Research School of Biology, The Australian National University, Canberra, Australia
| | - Allen G Rodrigo
- The Research School of Biology, The Australian National University, Canberra, Australia
| |
Collapse
|
7
|
Ranjard L, Wong TKF, Rodrigo AG. Effective machine-learning assembly for next-generation amplicon sequencing with very low coverage. BMC Bioinformatics 2019; 20:654. [PMID: 31829137 PMCID: PMC6907241 DOI: 10.1186/s12859-019-3287-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2019] [Accepted: 11/20/2019] [Indexed: 01/20/2023] Open
Abstract
BACKGROUND In short-read DNA sequencing experiments, the read coverage is a key parameter to successfully assemble the reads and reconstruct the sequence of the input DNA. When coverage is very low, the original sequence reconstruction from the reads can be difficult because of the occurrence of uncovered gaps. Reference guided assembly can then improve these assemblies. However, when the available reference is phylogenetically distant from the sequencing reads, the mapping rate of the reads can be extremely low. Some recent improvements in read mapping approaches aim at modifying the reference according to the reads dynamically. Such approaches can significantly improve the alignment rate of the reads onto distant references but the processing of insertions and deletions remains challenging. RESULTS Here, we introduce a new algorithm to update the reference sequence according to previously aligned reads. Substitutions, insertions and deletions are performed in the reference sequence dynamically. We evaluate this approach to assemble a western-grey kangaroo mitochondrial amplicon. Our results show that more reads can be aligned and that this method produces assemblies of length comparable to the truth while limiting error rate when classic approaches fail to recover the correct length. Finally, we discuss how the core algorithm of this method could be improved and combined with other approaches to analyse larger genomic sequences. CONCLUSIONS We introduced an algorithm to perform dynamic alignment of reads on a distant reference. We showed that such approach can improve the reconstruction of an amplicon compared to classically used bioinformatic pipelines. Although not portable to genomic scale in the current form, we suggested several improvements to be investigated to make this method more flexible and allow dynamic alignment to be used for large genome assemblies.
Collapse
Affiliation(s)
- Louis Ranjard
- The Research School of Biology, The Australian National University, Canberra, Australia
| | - Thomas K. F. Wong
- The Research School of Biology, The Australian National University, Canberra, Australia
| | - Allen G. Rodrigo
- The Research School of Biology, The Australian National University, Canberra, Australia
| |
Collapse
|
8
|
Wong TKF, Ranjard L, Lin Y, Rodrigo AG. HaploJuice : accurate haplotype assembly from a pool of sequences with known relative concentrations. BMC Bioinformatics 2018; 19:389. [PMID: 30348075 PMCID: PMC6198429 DOI: 10.1186/s12859-018-2424-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2018] [Accepted: 10/09/2018] [Indexed: 11/10/2022] Open
Abstract
Background Pooling techniques, where multiple sub-samples are mixed in a single sample, are widely used to take full advantage of high-throughput DNA sequencing. Recently, Ranjard et al. (PLoS ONE 13:0195090, 2018) proposed a pooling strategy without the use of barcodes. Three sub-samples were mixed in different known proportions (i.e. 62.5%, 25% and 12.5%), and a method was developed to use these proportions to reconstruct the three haplotypes effectively. Results HaploJuice provides an alternative haplotype reconstruction algorithm for Ranjard et al.’s pooling strategy. HaploJuice significantly increases the accuracy by first identifying the empirical proportions of the three mixed sub-samples and then assembling the haplotypes using a dynamic programming approach. HaploJuice was evaluated against five different assembly algorithms, Hmmfreq (Ranjard et al., PLoS ONE 13:0195090, 2018), ShoRAH (Zagordi et al., BMC Bioinformatics 12:119, 2011), SAVAGE (Baaijens et al., Genome Res 27:835-848, 2017), PredictHaplo (Prabhakaran et al., IEEE/ACM Trans Comput Biol Bioinform 11:182-91, 2014) and QuRe (Prosperi and Salemi, Bioinformatics 28:132-3, 2012). Using simulated and real data sets, HaploJuice reconstructed the true sequences with the highest coverage and the lowest error rate. Conclusion HaploJuice provides high accuracy in haplotype reconstruction, making Ranjard et al.’s pooling strategy more efficient, feasible, and applicable, with the benefit of reducing the sequencing cost.
Collapse
Affiliation(s)
- Thomas K F Wong
- The Research School of Biology, The Australian National University, Acton ACT, 2601, Australia.
| | - Louis Ranjard
- The Research School of Biology, The Australian National University, Acton ACT, 2601, Australia
| | - Yu Lin
- College of Engineering and Computer Science, The Australian National University, Acton ACT, 2601, Australia
| | - Allen G Rodrigo
- The Research School of Biology, The Australian National University, Acton ACT, 2601, Australia
| |
Collapse
|
9
|
Ranjard L, Wong TKF, Rodrigo AG. Reassembling haplotypes in a mixture of pooled amplicons when the relative concentrations are known: A proof-of-concept study on the efficient design of next-generation sequencing strategies. PLoS One 2018; 13:e0195090. [PMID: 29621260 PMCID: PMC5886459 DOI: 10.1371/journal.pone.0195090] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2017] [Accepted: 03/18/2018] [Indexed: 12/02/2022] Open
Abstract
Next-generation sequencing can be costly and labour intensive. Usually, the sequencing cost per sample is reduced by pooling amplified DNA = amplicons) derived from different individuals on the same sequencing lane. Barcodes unique to each amplicon permit short-read sequences to be assigned appropriately. However, the cost of the library preparation increases with the number of barcodes used. We propose an alternative to barcoding: by using different known proportions of individually-derived amplicons in a pooled sample, each is characterised a priori by an expected depth of coverage. We have developed a Hidden Markov Model that uses these expected proportions to reconstruct the input sequences. We apply this method to pools of mitochondrial DNA amplicons extracted from kangaroo meat, genus Macropus. Our experiments indicate that the sequence coverage can be efficiently used to index the short-reads and that we can reassemble the input haplotypes when secondary factors impacting the coverage are controlled. We therefore demonstrate that, by combining our approach with standard barcoding, the cost of the library preparation is reduced to a third.
Collapse
Affiliation(s)
- Louis Ranjard
- The Research School of Biology, The Australian National University, Australia
- * E-mail:
| | - Thomas K. F. Wong
- The Research School of Biology, The Australian National University, Australia
| | - Allen G. Rodrigo
- The Research School of Biology, The Australian National University, Australia
| |
Collapse
|
10
|
Ranjard L, Wong TKF, Külheim C, Rodrigo AG, Ragg NLC, Patel S, Dunphy BJ. Complete mitochondrial genome of the green-lipped mussel, Perna canaliculus (Mollusca: Mytiloidea), from long nanopore sequencing reads. Mitochondrial DNA B Resour 2018; 3:175-176. [PMID: 33490494 PMCID: PMC7801018 DOI: 10.1080/23802359.2018.1437810] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
We describe here the first complete genome assembly of the New Zealand green-lipped mussel, Perna canaliculus, mitochondrion. The assembly was performed de novo from a mix of long nanopore sequencing reads and short sequencing reads. The genome is 16,005 bp long. Comparison to other Mytiloidea mitochondrial genomes indicates important gene rearrangements in this family.
Collapse
Affiliation(s)
- Louis Ranjard
- Research School of Biology, ANU College of Science, The Australian National University, Canberra, Australia
| | - Thomas K F Wong
- Research School of Biology, ANU College of Science, The Australian National University, Canberra, Australia
| | - Carsten Külheim
- Research School of Biology, ANU College of Science, The Australian National University, Canberra, Australia
| | - Allen G Rodrigo
- Research School of Biology, ANU College of Science, The Australian National University, Canberra, Australia
| | | | - Selina Patel
- School of Biological Sciences, University of Auckland, Auckland, New Zealand
| | - Brendon J Dunphy
- School of Biological Sciences, University of Auckland, Auckland, New Zealand
| |
Collapse
|
11
|
Tay WT, Walsh TK, Downes S, Anderson C, Jermiin LS, Wong TKF, Piper MC, Chang ES, Macedo IB, Czepak C, Behere GT, Silvie P, Soria MF, Frayssinet M, Gordon KHJ. Mitochondrial DNA and trade data support multiple origins of Helicoverpa armigera (Lepidoptera, Noctuidae) in Brazil. Sci Rep 2017; 7:45302. [PMID: 28350004 PMCID: PMC5368605 DOI: 10.1038/srep45302] [Citation(s) in RCA: 52] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2016] [Accepted: 02/23/2017] [Indexed: 01/31/2023] Open
Abstract
The Old World bollworm Helicoverpa armigera is now established in Brazil but efforts to identify incursion origin(s) and pathway(s) have met with limited success due to the patchiness of available data. Using international agricultural/horticultural commodity trade data and mitochondrial DNA (mtDNA) cytochrome oxidase I (COI) and cytochrome b (Cyt b) gene markers, we inferred the origins and incursion pathways into Brazil. We detected 20 mtDNA haplotypes from six Brazilian states, eight of which were new to our 97 global COI-Cyt b haplotype database. Direct sequence matches indicated five Brazilian haplotypes had Asian, African, and European origins. We identified 45 parsimoniously informative sites and multiple substitutions per site within the concatenated (945 bp) nucleotide dataset, implying that probabilistic phylogenetic analysis methods are needed. High diversity and signatures of uniquely shared haplotypes with diverse localities combined with the trade data suggested multiple incursions and introduction origins in Brazil. Increasing agricultural/horticultural trade activities between the Old and New Worlds represents a significant biosecurity risk factor. Identifying pest origins will enable resistance profiling that reflects countries of origin to be included when developing a resistance management strategy, while identifying incursion pathways will improve biosecurity protocols and risk analysis at biosecurity hotspots including national ports.
Collapse
Affiliation(s)
- Wee Tek Tay
- CSIRO, Black Mountain Laboratories, Clunies Ross Street, ACT 2601, Australia
| | - Thomas K. Walsh
- CSIRO, Black Mountain Laboratories, Clunies Ross Street, ACT 2601, Australia
| | - Sharon Downes
- CSIRO, Myall Vale Laboratories, Kamilaroi Highway, Narrabri, NSW 2390, Australia
| | - Craig Anderson
- CSIRO, Black Mountain Laboratories, Clunies Ross Street, ACT 2601, Australia
- Biological and Environmental Sciences, University of Stirling, Stirling, FK9 4LA, UK
| | - Lars S. Jermiin
- CSIRO, Black Mountain Laboratories, Clunies Ross Street, ACT 2601, Australia
- Research School of Biology, Australian National University, Acton, ACT 2601, Australia
| | - Thomas K. F. Wong
- CSIRO, Black Mountain Laboratories, Clunies Ross Street, ACT 2601, Australia
- Research School of Biology, Australian National University, Acton, ACT 2601, Australia
| | - Melissa C. Piper
- CSIRO, Black Mountain Laboratories, Clunies Ross Street, ACT 2601, Australia
| | - Ester Silva Chang
- CSIRO, Black Mountain Laboratories, Clunies Ross Street, ACT 2601, Australia
- Universidade de São Paulo, Instituto de Biociências, São Paulo, SP, 05508-090, Brazil
| | - Isabella Barony Macedo
- CSIRO, Black Mountain Laboratories, Clunies Ross Street, ACT 2601, Australia
- Universidade Federal de Minas Gerais, Faculdade de Farmácia, Belo Horizonte, MG, 31270-901, Brazil
| | - Cecilia Czepak
- Universidade Federal de Goiás, Escola de Agronomia, Goiânia, GO, 75804-020, Brazil
| | - Gajanan T. Behere
- Division of Crop Protection, ICAR Research Complex for North East Hill Region, Umroi Road, Umiam, Meghalaya, 793103, India
| | - Pierre Silvie
- IRD, UMR EGCE, FR-91198 Gif-sur-Yvette Cedex, France
- CIRAD, UPR AÏDA, F-34398 Montpellier Cedex 05, France
| | - Miguel F. Soria
- Bayer S.A., Crop Science Division, São Paulo, SP, 04779-900, Brazil
| | | | - Karl H. J. Gordon
- CSIRO, Black Mountain Laboratories, Clunies Ross Street, ACT 2601, Australia
| |
Collapse
|
12
|
Jayaswal V, Wong TKF, Robinson J, Poladian L, Jermiin LS. Mixture models of nucleotide sequence evolution that account for heterogeneity in the substitution process across sites and across lineages. Syst Biol 2014; 63:726-42. [PMID: 24927722 DOI: 10.1093/sysbio/syu036] [Citation(s) in RCA: 51] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Molecular phylogenetic studies of homologous sequences of nucleotides often assume that the underlying evolutionary process was globally stationary, reversible, and homogeneous (SRH), and that a model of evolution with one or more site-specific and time-reversible rate matrices (e.g., the GTR rate matrix) is enough to accurately model the evolution of data over the whole tree. However, an increasing body of data suggests that evolution under these conditions is an exception, rather than the norm. To address this issue, several non-SRH models of molecular evolution have been proposed, but they either ignore heterogeneity in the substitution process across sites (HAS) or assume it can be modeled accurately using the distribution. As an alternative to these models of evolution, we introduce a family of mixture models that approximate HAS without the assumption of an underlying predefined statistical distribution. This family of mixture models is combined with non-SRH models of evolution that account for heterogeneity in the substitution process across lineages (HAL). We also present two algorithms for searching model space and identifying an optimal model of evolution that is less likely to over- or underparameterize the data. The performance of the two new algorithms was evaluated using alignments of nucleotides with 10 000 sites simulated under complex non-SRH conditions on a 25-tipped tree. The algorithms were found to be very successful, identifying the correct HAL model with a 75% success rate (the average success rate for assigning rate matrices to the tree's 48 edges was 99.25%) and, for the correct HAL model, identifying the correct HAS model with a 98% success rate. Finally, parameter estimates obtained under the correct HAL-HAS model were found to be accurate and precise. The merits of our new algorithms were illustrated with an analysis of 42 337 second codon sites extracted from a concatenation of 106 alignments of orthologous genes encoded by the nuclear genomes of Saccharomyces cerevisiae, S. paradoxus, S. mikatae, S. kudriavzevii, S. castellii, S. kluyveri, S. bayanus, and Candida albicans. Our results show that second codon sites in the ancestral genome of these species contained 49.1% invariable sites, 39.6% variable sites belonging to one rate category (V1), and 11.3% variable sites belonging to a second rate category (V2). The ancestral nucleotide content was found to differ markedly across these three sets of sites, and the evolutionary processes operating at the variable sites were found to be non-SRH and best modeled by a combination of eight edge-specific rate matrices (four for V1 and four for V2). The number of substitutions per site at the variable sites also differed markedly, with sites belonging to V1 evolving slower than those belonging to V2 along the lineages separating the seven species of Saccharomyces. Finally, sites belonging to V1 appeared to have ceased evolving along the lineages separating S. cerevisiae, S. paradoxus, S. mikatae, S. kudriavzevii, and S. bayanus, implying that they might have become so selectively constrained that they could be considered invariable sites in these species.
Collapse
Affiliation(s)
- Vivek Jayaswal
- School of Biomedical Sciences, Queensland University of Technology, Brisbane, QLD 4000, Australia; School of Mathematics and Statistics, University of Sydney, Sydney, NSW 2006, Australia; CSIRO Ecosystem Sciences, Canberra, ACT 2601, Australia; and Centre for Mathematical Biology, University of Sydney, Sydney, NSW 2006, AustraliaSchool of Biomedical Sciences, Queensland University of Technology, Brisbane, QLD 4000, Australia; School of Mathematics and Statistics, University of Sydney, Sydney, NSW 2006, Australia; CSIRO Ecosystem Sciences, Canberra, ACT 2601, Australia; and Centre for Mathematical Biology, University of Sydney, Sydney, NSW 2006, Australia
| | - Thomas K F Wong
- School of Biomedical Sciences, Queensland University of Technology, Brisbane, QLD 4000, Australia; School of Mathematics and Statistics, University of Sydney, Sydney, NSW 2006, Australia; CSIRO Ecosystem Sciences, Canberra, ACT 2601, Australia; and Centre for Mathematical Biology, University of Sydney, Sydney, NSW 2006, Australia
| | - John Robinson
- School of Biomedical Sciences, Queensland University of Technology, Brisbane, QLD 4000, Australia; School of Mathematics and Statistics, University of Sydney, Sydney, NSW 2006, Australia; CSIRO Ecosystem Sciences, Canberra, ACT 2601, Australia; and Centre for Mathematical Biology, University of Sydney, Sydney, NSW 2006, AustraliaSchool of Biomedical Sciences, Queensland University of Technology, Brisbane, QLD 4000, Australia; School of Mathematics and Statistics, University of Sydney, Sydney, NSW 2006, Australia; CSIRO Ecosystem Sciences, Canberra, ACT 2601, Australia; and Centre for Mathematical Biology, University of Sydney, Sydney, NSW 2006, Australia
| | - Leon Poladian
- School of Biomedical Sciences, Queensland University of Technology, Brisbane, QLD 4000, Australia; School of Mathematics and Statistics, University of Sydney, Sydney, NSW 2006, Australia; CSIRO Ecosystem Sciences, Canberra, ACT 2601, Australia; and Centre for Mathematical Biology, University of Sydney, Sydney, NSW 2006, AustraliaSchool of Biomedical Sciences, Queensland University of Technology, Brisbane, QLD 4000, Australia; School of Mathematics and Statistics, University of Sydney, Sydney, NSW 2006, Australia; CSIRO Ecosystem Sciences, Canberra, ACT 2601, Australia; and Centre for Mathematical Biology, University of Sydney, Sydney, NSW 2006, Australia
| | - Lars S Jermiin
- School of Biomedical Sciences, Queensland University of Technology, Brisbane, QLD 4000, Australia; School of Mathematics and Statistics, University of Sydney, Sydney, NSW 2006, Australia; CSIRO Ecosystem Sciences, Canberra, ACT 2601, Australia; and Centre for Mathematical Biology, University of Sydney, Sydney, NSW 2006, Australia
| |
Collapse
|
13
|
Hu X, Wong TKF, Lu ZJ, Chan TF, Lau TCK, Yiu SM, Yip KY. Computational identification of protein binding sites on RNAs using high-throughput RNA structure-probing data. ACTA ACUST UNITED AC 2013; 30:1049-1055. [PMID: 24376038 DOI: 10.1093/bioinformatics/btt757] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2013] [Accepted: 12/13/2013] [Indexed: 11/14/2022]
Abstract
MOTIVATION High-throughput sequencing has been used to probe RNA structures, by treating RNAs with reagents that preferentially cleave or mark certain nucleotides according to their local structures, followed by sequencing of the resulting fragments. The data produced contain valuable information for studying various RNA properties. RESULTS We developed methods for statistically modeling these structure-probing data and extracting structural features from them. We show that the extracted features can be used to predict RNA 'zipcodes' in yeast, regions bound by the She complex in asymmetric localization. The prediction accuracy was better than using raw RNA probing data or sequence features. We further demonstrate the use of the extracted features in identifying binding sites of RNA binding proteins from whole-transcriptome global photoactivatable-ribonucleoside-enhanced cross-linking and immunopurification (gPAR-CLIP) data. AVAILABILITY The source code of our implemented methods is available at http://yiplab.cse.cuhk.edu.hk/probrna/ CONTACT: kevinyip@cse.cuhk.edu.hk Supplementary information: Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Xihao Hu
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong, Department of Computer Science, The University of Hong Kong, Pokfulam Road, Hong Kong, CSIRO Ecosystem Sciences, Canberra, ACT 2601, Australia, MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing, China 100084, School of Life Sciences, Hong Kong Bioinformatics Centre, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong and Department of Biology and Chemistry, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong
| | - Thomas K F Wong
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong, Department of Computer Science, The University of Hong Kong, Pokfulam Road, Hong Kong, CSIRO Ecosystem Sciences, Canberra, ACT 2601, Australia, MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing, China 100084, School of Life Sciences, Hong Kong Bioinformatics Centre, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong and Department of Biology and Chemistry, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong, Department of Computer Science, The University of Hong Kong, Pokfulam Road, Hong Kong, CSIRO Ecosystem Sciences, Canberra, ACT 2601, Australia, MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing, China 100084, School of Life Sciences, Hong Kong Bioinformatics Centre, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong and Department of Biology and Chemistry, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong
| | - Zhi John Lu
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong, Department of Computer Science, The University of Hong Kong, Pokfulam Road, Hong Kong, CSIRO Ecosystem Sciences, Canberra, ACT 2601, Australia, MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing, China 100084, School of Life Sciences, Hong Kong Bioinformatics Centre, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong and Department of Biology and Chemistry, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong
| | - Ting Fung Chan
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong, Department of Computer Science, The University of Hong Kong, Pokfulam Road, Hong Kong, CSIRO Ecosystem Sciences, Canberra, ACT 2601, Australia, MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing, China 100084, School of Life Sciences, Hong Kong Bioinformatics Centre, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong and Department of Biology and Chemistry, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong, Department of Computer Science, The University of Hong Kong, Pokfulam Road, Hong Kong, CSIRO Ecosystem Sciences, Canberra, ACT 2601, Australia, MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing, China 100084, School of Life Sciences, Hong Kong Bioinformatics Centre, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong and Department of Biology and Chemistry, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong
| | - Terrence Chi Kong Lau
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong, Department of Computer Science, The University of Hong Kong, Pokfulam Road, Hong Kong, CSIRO Ecosystem Sciences, Canberra, ACT 2601, Australia, MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing, China 100084, School of Life Sciences, Hong Kong Bioinformatics Centre, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong and Department of Biology and Chemistry, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong
| | - Siu Ming Yiu
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong, Department of Computer Science, The University of Hong Kong, Pokfulam Road, Hong Kong, CSIRO Ecosystem Sciences, Canberra, ACT 2601, Australia, MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing, China 100084, School of Life Sciences, Hong Kong Bioinformatics Centre, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong and Department of Biology and Chemistry, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong
| | - Kevin Y Yip
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong, Department of Computer Science, The University of Hong Kong, Pokfulam Road, Hong Kong, CSIRO Ecosystem Sciences, Canberra, ACT 2601, Australia, MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing, China 100084, School of Life Sciences, Hong Kong Bioinformatics Centre, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong and Department of Biology and Chemistry, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong, Department of Computer Science, The University of Hong Kong, Pokfulam Road, Hong Kong, CSIRO Ecosystem Sciences, Canberra, ACT 2601, Australia, MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing, China 100084, School of Life Sciences, Hong Kong Bioinformatics Centre, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong and Department of Biology and Chemistry, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong
| |
Collapse
|
14
|
Ma C, Wong TKF, Lam TW, Hon WK, Sadakane K, Yiu SM. An efficient alignment algorithm for searching simple pseudoknots over long genomic sequence. IEEE/ACM Trans Comput Biol Bioinform 2012; 9:1629-1638. [PMID: 22848134 DOI: 10.1109/tcbb.2012.104] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
Structural alignment has been shown to be an effective computational method to identify structural noncoding RNA(ncRNA) candidates as ncRNAs are known to be conserved in secondary structures. However, the complexity of the structural alignment algorithms becomes higher when the structure has pseudoknots. Even for the simplest type of pseudoknots (simple pseudoknots), the fastest algorithm runs in O(mn3) time, where m, n are the length of the query ncRNA (with known structure) and the length of the target sequence (with unknown structure), respectively. In practice, we are usually given a long DNA sequence and we try to locate regions in the sequence for possible candidates of a particular ncRNA. Thus, we need to run the structural alignment algorithm on every possible region in the long sequence. For example, finding candidates for a known ncRNA of length 100 on a sequence of length 50,000, it takes more than one day. In this paper, we provide an efficient algorithm to solve the problem for simple pseudoknots and it is shown to be 10 times faster. The speedup stems from an effective pruning strategy consisting of the computation of a lower bound score for the optimal alignment and an estimation of the maximum score that a candidate can achieve to decide whether to prune the current candidate or not.
Collapse
Affiliation(s)
- Christopher Ma
- Department of Computer Science, The University of Hong Kong, Rm 301, Chow Yei Ching Building, Pokfulam Road, Hong Kong.
| | | | | | | | | | | |
Collapse
|
15
|
Wong TKF, Chiu YS, Lam TW, Yiu SM. Memory efficient algorithms for structural alignment of RNAs with pseudoknots. IEEE/ACM Trans Comput Biol Bioinform 2012; 9:161-168. [PMID: 21464506 DOI: 10.1109/tcbb.2011.66] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
In this paper, we consider the problem of structural alignment of a target RNA sequence of length n and a query RNA sequence of length m with known secondary structure that may contain simple pseudoknots or embedded simple pseudoknots. The best known algorithm for solving this problem runs in O(mn3) time for simple pseudoknot or O(mn4) time for embedded simple pseudoknot with space complexity of O(mn3) for both structures, which require too much memory making it infeasible for comparing noncoding RNAs (ncRNAs) with length several hundreds or more. We propose memory efficient algorithms to solve the same problem. We reduce the space complexity to O(n3) for simple pseudoknot and O(mn2 + n3) for embedded simple pseudoknot while maintaining the same time complexity. We also show how to modify our algorithm to handle a restricted class of recursive simple pseudoknot which is found abundant in real data with space complexity of O(mn2 + n3) and time complexity of O(mn4). Experimental results show that our algorithms are feasible for comparing ncRNAs of length more than 500.
Collapse
|
16
|
Abstract
MOTIVATION Structural alignment of RNA is found to be a useful computational technique for idenitfying non-coding RNAs (ncRNAs). However, existing tools do not handle structures with pseudoknots. Although algorithms exist that can handle structural alignment for different types of pseudoknots, no software tools are available and users have to determine the type of pseudoknots to select the appropriate algoirthm to use which limits the usage of structural alignment in identifying novel ncRNAs. RESULTS We implemented the first web server, RNASAlign, which can automatically identify the pseudoknot type of a secondary structure and perform structural alignment of a folded RNA with every region of a target DNA/RNA sequence. Regions with high similarity scores and low e-values, together with the detailed alignments will be reported to the user. Experiments on more than 350 ncRNA families show that RNASAlign is effective. AVAILABILITY http://www.bio8.cs.hku.hk/RNASAlign.
Collapse
Affiliation(s)
- Thomas K F Wong
- Department of Computer Science, The University of Hong Kong, Pokfulam Road, Hong Kong
| | | | | | | | | | | | | |
Collapse
|
17
|
Abstract
Background Orthologues are genes in different species that are related through divergent evolution from a common ancestor and are expected to have similar functions. Many databases have been created to describe orthologous genes based on existing sequence data. However, alternative splicing (in eukaryotes) is usually disregarded in the determination of orthologue groups and the functional consequences of alternative splicing have not been considered. Most multi-exon genes can encode multiple protein isoforms which often have different functions and can be disease-related. Extending the definition of orthologue groups to take account of alternate splicing and the functional differences it causes requires further examination. Results A subset of the orthologous gene groups between human and mouse was selected from the InParanoid database for this study. Each orthologue group was divided into sub-clusters, at the transcript level, using a method based on the sequence similarity of the isoforms. Transcript based sub-clusters were verified by functional signatures of the cluster members in the InterPro database. Functional similarity was higher within than between transcript-based sub-clusters of a defined orthologous group. In certain cases, cancer-related isoforms of a gene could be distinguished from other isoforms of the gene. Predictions of intrinsic disorder in protein regions were also correlated with the isoform sub-clusters within an orthologue group. Conclusions Sub-clustering of orthologue groups at the transcript level is an important step to more accurately define functionally equivalent orthologue groups. This work appears to be the first effort to refine orthologous groupings of genes based on the consequences of alternative splicing on function. Further investigation and refinement of the methodology to classify and verify isoform sub-clusters is needed, particularly to extend the technique to more distantly related species.
Collapse
Affiliation(s)
- Yizhen Jia
- Department of Biochemistry, The University of Hong Kong, Hong Kong.
| | | | | | | | | |
Collapse
|
18
|
Abstract
Background Non-coding RNAs (ncRNAs) are known to be involved in many critical biological processes, and identification of ncRNAs is an important task in biological research. A popular software, Infernal, is the most successful prediction tool and exhibits high sensitivity. The application of Infernal has been mainly focused on small suspected regions. We tried to apply Infernal on a chromosome level; the results have high sensitivity, yet contain many false positives. Further enhancing Infernal for chromosome level or genome wide study is desirable. Methodology Based on the conjecture that adjacent nucleotide dependence affects the stability of the secondary structure of an ncRNA, we first conduct a systematic study on human ncRNAs and find that adjacent nucleotide dependence in human ncRNA should be useful for identifying ncRNAs. We then incorporate this dependence in the SCFG model and develop a new order-1 SCFG model for identifying ncRNAs. Conclusions With respect to our experiments on human chromosomes, the proposed new model can eliminate more than 50% false positives reported by Infernal while maintaining the same sensitivity. The executable and the source code of programs are freely available at http://i.cs.hku.hk/~kfwong/order1scfg.
Collapse
Affiliation(s)
- Thomas K F Wong
- Department of Computer Science, The University of Hong Kong, Hong Kong, Special Administrative Region, People's Republic of China.
| | | | | | | |
Collapse
|
19
|
Wong TKF, Lam TW, Yiu SM, Wong SCK. Improving the accuracy of signal transduction pathway construction using Level-2 neighbours. Int J Bioinform Res Appl 2010; 6:542-555. [PMID: 21354961 DOI: 10.1504/ijbra.2010.038736] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
In this paper, we consider the problem of reconstructing a pathway for a given set of proteins based on available genomics and proteomics information such as gene expression data. In all previous approaches, the scoring function for a candidate pathway usually only depends on adjacent proteins in the pathway. We propose to also consider proteins that are of distance two in the pathway (we call them Level-2 neighbours). We derive a scoring function based on both adjacent proteins and Level-2 neighbours in the pathway and show that our scoring function can increase the accuracy of the predicted pathways through a set of experiments. The problem of computing the pathway with optimal score, in general, is NP-hard. We thus extend a randomised algorithm to make it work on our scoring function to compute the optimal pathway with high probability.
Collapse
Affiliation(s)
- Thomas K F Wong
- Faculty of Engineering, Department of Computer Science, The University of Hong Kong, Pokfulam Road, Hong Kong.
| | | | | | | |
Collapse
|
20
|
Abstract
In the sequencing process, reads of the sequence are generated, then assembled to form contigs. New technologies can produce reads faster with lower cost and higher coverage. However, these reads are shorter. With errors, short reads make the assembly step more difficult. Chaisson et al. (2004) proposed an algorithm to correct the reads prior to the assembly step. The result is not satisfactory when the error rate is high (e.g., >or=3%). We improve their approach to handle reads of higher error rates. Experimental results show that our approach is much more effective in correcting errors, producing contigs of higher quality.
Collapse
Affiliation(s)
- Thomas K F Wong
- Faculty of Engineering, Department of Computer Science, The University of Hong Kong, Pokfulam Road, Hong Kong.
| | | | | | | |
Collapse
|
21
|
Wong TKF, Lam TW, Yang W, Yiu SM. Finding alternative splicing patterns with strong support from expressed sequences on individual exons/introns. J Bioinform Comput Biol 2009; 6:1021-33. [PMID: 18942164 DOI: 10.1142/s0219720008003825] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2007] [Revised: 02/27/2008] [Accepted: 03/22/2008] [Indexed: 11/18/2022]
Abstract
We consider the problem of predicting alternative splicing patterns from a set of expressed sequences (cDNAs and ESTs). Some of these expressed sequences may be errorous, thus forming incorrect exons/introns. These incorrect exons/introns may cause a lot of false positives. For example, we examined a popular alternative splicing database, ECgene, which predicts alternate splicing patterns from expressed sequences. The result shows that about 81.3%-81.6% (sensitivity) of known patterns are found, but the specificity can be as low as 5.9%. Based on the idea that errorous sequences are usually not consistent with other sequences, in this paper we provide an alternative approach for finding alternative splicing patterns which ensures that individual exons/introns of the reported patterns have enough support from the expressed sequences. On the same dataset, our approach can achieve a much higher specificity and a slight increase in sensitivity (38.9% and 84.9%, respectively). Our approach also gives better results compared with popular alternative splicing databases (ASD, ECgene, SpliceNest) and the software ClusterMerge.
Collapse
Affiliation(s)
- Thomas K F Wong
- Department of Computer Science, The University of Hong Kong, Hong Kong.
| | | | | | | |
Collapse
|
22
|
Yang W, Ng P, Zhao M, Wong TKF, Yiu SM, Lau YL. Promoter-sharing by different genes in human genome--CPNE1 and RBM12 gene pair as an example. BMC Genomics 2008; 9:456. [PMID: 18831769 PMCID: PMC2568002 DOI: 10.1186/1471-2164-9-456] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2008] [Accepted: 10/03/2008] [Indexed: 11/27/2022] Open
Abstract
Background Regulation of gene expression plays important role in cellular functions. Co-regulation of different genes may indicate functional connection or even physical interaction between gene products. Thus analysis on genomic structures that may affect gene expression regulation could shed light on the functions of genes. Results In a whole genome analysis of alternative splicing events, we found that two distinct genes, copine I (CPNE1) and RNA binding motif protein 12 (RBM12), share the most 5' exons and therefore the promoter region in human. Further analysis identified many gene pairs in human genome that share the same promoters and 5' exons but have totally different coding sequences. Analysis of genomic and expressed sequences, either cDNAs or expressed sequence tags (ESTs) for CPNE1 and RBM12, confirmed the conservation of this phenomenon during evolutionary courses. The co-expression of the two genes initiated from the same promoter is confirmed by Reverse Transcription-Polymerase Chain Reaction (RT-PCR) in different tissues in both human and mouse. High degrees of sequence conservation among multiple species in the 5'UTR region common to CPNE1 and RBM12 were also identified. Conclusion Promoter and 5'UTR sharing between CPNE1 and RBM12 is observed in human, mouse and zebrafish. Conservation of this genomic structure in evolutionary courses indicates potential functional interaction between the two genes. More than 20 other gene pairs in human genome were found to have the similar genomic structure in a genome-wide analysis, and it may represent a unique pattern of genomic arrangement that may affect expression regulation of the corresponding genes.
Collapse
Affiliation(s)
- Wanling Yang
- Department of Paediatrics & Adolescent Medicine, LKS Faculty of Medicine, University of Hong Kong, Hong Kong, PR China.
| | | | | | | | | | | |
Collapse
|