1
|
Schröder M, Muller SH, Vradi E, Mielke J, Lim YM, Couvelard F, Mostert M, Koudstaal S, Eijkemans MJ, Gerlinger C. Sharing Medical Big Data While Preserving Patient Confidentiality in Innovative Medicines Initiative: A Summary and Case Report from BigData@Heart. BIG DATA 2023; 11:399-407. [PMID: 37889577 PMCID: PMC10733752 DOI: 10.1089/big.2022.0178] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/29/2023]
Abstract
Sharing individual patient data (IPD) is a simple concept but complex to achieve due to data privacy and data security concerns, underdeveloped guidelines, and legal barriers. Sharing IPD is additionally difficult in big data-driven collaborations such as Bigdata@Heart in the Innovative Medicines Initiative, due to competing interests between diverse consortium members. One project within BigData@Heart, case study 1, needed to pool data from seven heterogeneous data sets: five randomized controlled trials from three different industry partners, and two disease registries. Sharing IPD was not considered feasible due to legal requirements and the sensitive medical nature of these data. In addition, harmonizing the data sets for a federated data analysis was difficult due to capacity constraints and the heterogeneity of the data sets. An alternative option was to share summary statistics through contingency tables. Here it is demonstrated that this method along with anonymization methods to ensure patient anonymity had minimal loss of information. Although sharing IPD should continue to be encouraged and strived for, our approach achieved a good balance between data transparency while protecting patient privacy. It also allowed a successful collaboration between industry and academia.
Collapse
Affiliation(s)
- Megan Schröder
- The Institute for Medical Information Processing, Biometry, and Epidemiology (IBE), Ludwig-Maximilians-Universität München, Münich, Germany
| | - Sam H.A. Muller
- Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht University, Utrecht, The Netherlands
| | - Eleni Vradi
- Biomedical Data Science II, Bayer AG, Berlin, Germany
| | - Johanna Mielke
- Research and Early Development, Bayer AG, Wuppertal, Germany
| | - Yvonne M.F. Lim
- Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht University, Utrecht, The Netherlands
- Institute for Clinical Research, National Institutes of Health, Selangor, Malaysia
| | - Fabrice Couvelard
- Institut de Recherches Internationales SERVIER (I.R.I.S.), Suresnes, France
| | - Menno Mostert
- Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht University, Utrecht, The Netherlands
| | - Stefan Koudstaal
- Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht University, Utrecht, The Netherlands
- Division of Heart and Lungs, Department of Cardiology, University Medical Center Utrecht, Utrecht University, Utrecht, The Netherlands
- Department of Cardiology, Groene Hart Ziekenhuis, Gouda, The Netherlands
| | - Marinus J.C. Eijkemans
- Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht University, Utrecht, The Netherlands
| | - Christoph Gerlinger
- Clinical Statistics and Data Insights, Bayer AG, Berlin, Germany
- Department of Gynecology, Obstetrics and Reproductive Medicine, University Medical School of Saarland, Homburg/Saar, Germany
| |
Collapse
|
2
|
Overview of Federated Facility to Harmonize, Analyze and Management of Missing Data in Cohorts. APPLIED SCIENCES-BASEL 2019. [DOI: 10.3390/app9194103] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
Abstract
Cohorts are instrumental for epidemiologically oriented observational studies. Cohort studies usually observe large groups of individuals for a specific period of time to identify the contributing factors to a specific outcome (for instance an illness) and create associations between risk factors and the outcome under study. In collaborative projects, federated data facilities are meta-database systems that are distributed across multiple locations that permit to analyze, combine, or harmonize data from different sources making them suitable for mega- and meta-analyses. The harmonization of data can increase the statistical power of studies through maximization of sample size, allowing for additional refined statistical analyses, which ultimately lead to answer research questions that could not be addressed while using a single study. Indeed, harmonized data can be analyzed through mega-analysis of raw data or fixed effects meta-analysis. Other types of data might be analyzed by e.g., random-effects meta-analyses or Bayesian evidence synthesis. In this article, we describe some methodological aspects related to the construction of a federated facility to optimize analyses of multiple datasets, the impact of missing data, and some methods for handling missing data in cohort studies.
Collapse
|
3
|
Teare HJA, de Masi F, Banasik K, Barnett A, Herrgard S, Jablonka B, Postma JWM, McDonald TJ, Forgie I, Chmura PJ, Rydzka EK, Gupta R, Brunak S, Pearson E, Kaye J. The governance structure for data access in the DIRECT consortium: an innovative medicines initiative (IMI) project. LIFE SCIENCES, SOCIETY AND POLICY 2018; 14:20. [PMID: 30182269 PMCID: PMC6123336 DOI: 10.1186/s40504-018-0083-0] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/14/2018] [Accepted: 07/23/2018] [Indexed: 06/08/2023]
Abstract
Biomedical research projects involving multiple partners from public and private sectors require coherent internal governance mechanisms to engender good working relationships. The DIRECT project is an example of such a venture, funded by the Innovative Medicines Initiative Joint Undertaking (IMI JU). This paper describes the data access policy that was developed within DIRECT to support data access and sharing, via the establishment of a 3-tiered Data Access Committee. The process was intended to allow quick access to data, whilst enabling strong oversight of how data were being accessed and by whom, and any subsequent analyses, to contribute to the overall objectives of the consortium.
Collapse
Affiliation(s)
- Harriet J. A. Teare
- HeLEX Centre, University of Oxford, Ewert House, Banbury Road, Oxford, OX2 7DD UK
- Melbourne Law School, University of Melbourne, 185 Pelham Street, Carlton, VIC 3053 Australia
| | - Federico de Masi
- Center for Biological Sequence Analysis, Department of Bio and Health Informatics, Technical University of Denmark, Building 208, DK-2800 Lyngby, Denmark
| | - Karina Banasik
- Translational Disease Systems Biology, NNF Center for Protein Research, University of Copenhagen, Faculty of Health and Medical Sciences, Blegdamsvej 3B, DK-2200 Copenhagen, Denmark
| | - Anna Barnett
- Division of Molecular & Clinical Medicine, School of Medicine, University of Dundee, Ninewells Hospital & Medical School, Dundee, UK
| | - Sanna Herrgard
- Center for Biological Sequence Analysis, Department of Bio and Health Informatics, Technical University of Denmark, Building 208, DK-2800 Lyngby, Denmark
| | - Bernd Jablonka
- Sanofi-Aventis Deutschland GmbH, Industriepark Höchst, 65926 Frankfurt, Germany
| | - Jacqueline W. M. Postma
- Clinical Research Centre, Lund University Diabetes Centre, Box 50332, SE-202 13 Malmö, Sweden
| | - Timothy J. McDonald
- Blood Sciences, Template A2, Royal Devon and Exeter Hospital, Barrack Road, Exeter, EX2 5DW UK
| | - Ian Forgie
- Division of Molecular & Clinical Medicine, School of Medicine, University of Dundee, Ninewells Hospital & Medical School, Dundee, UK
| | - Piotr J. Chmura
- Center for Biological Sequence Analysis, Department of Bio and Health Informatics, Technical University of Denmark, Building 208, DK-2800 Lyngby, Denmark
| | - Emil K. Rydzka
- Center for Biological Sequence Analysis, Department of Bio and Health Informatics, Technical University of Denmark, Building 208, DK-2800 Lyngby, Denmark
| | - Ramneek Gupta
- Center for Biological Sequence Analysis, Department of Bio and Health Informatics, Technical University of Denmark, Building 208, DK-2800 Lyngby, Denmark
| | - Soren Brunak
- Center for Biological Sequence Analysis, Department of Bio and Health Informatics, Technical University of Denmark, Building 208, DK-2800 Lyngby, Denmark
- Translational Disease Systems Biology, NNF Center for Protein Research, University of Copenhagen, Faculty of Health and Medical Sciences, Blegdamsvej 3B, DK-2200 Copenhagen, Denmark
| | - Ewan Pearson
- Division of Molecular & Clinical Medicine, School of Medicine, University of Dundee, Ninewells Hospital & Medical School, Dundee, UK
| | - Jane Kaye
- HeLEX Centre, University of Oxford, Ewert House, Banbury Road, Oxford, OX2 7DD UK
- Melbourne Law School, University of Melbourne, 185 Pelham Street, Carlton, VIC 3053 Australia
| |
Collapse
|
4
|
Fortier I, Raina P, Van den Heuvel ER, Griffith LE, Craig C, Saliba M, Doiron D, Stolk RP, Knoppers BM, Ferretti V, Granda P, Burton P. Maelstrom Research guidelines for rigorous retrospective data harmonization. Int J Epidemiol 2017; 46:103-105. [PMID: 27272186 PMCID: PMC5407152 DOI: 10.1093/ije/dyw075] [Citation(s) in RCA: 85] [Impact Index Per Article: 12.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 03/16/2016] [Indexed: 12/26/2022] Open
Abstract
Background It is widely accepted and acknowledged that data harmonization is crucial: in its absence, the co-analysis of major tranches of high quality extant data is liable to inefficiency or error. However, despite its widespread practice, no formalized/systematic guidelines exist to ensure high quality retrospective data harmonization. Methods To better understand real-world harmonization practices and facilitate development of formal guidelines, three interrelated initiatives were undertaken between 2006 and 2015. They included a phone survey with 34 major international research initiatives, a series of workshops with experts, and case studies applying the proposed guidelines. Results A wide range of projects use retrospective harmonization to support their research activities but even when appropriate approaches are used, the terminologies, procedures, technologies and methods adopted vary markedly. The generic guidelines outlined in this article delineate the essentials required and describe an interdependent step-by-step approach to harmonization: 0) define the research question, objectives and protocol; 1) assemble pre-existing knowledge and select studies; 2) define targeted variables and evaluate harmonization potential; 3) process data; 4) estimate quality of the harmonized dataset(s) generated; and 5) disseminate and preserve final harmonization products. Conclusions This manuscript provides guidelines aiming to encourage rigorous and effective approaches to harmonization which are comprehensively and transparently documented and straightforward to interpret and implement. This can be seen as a key step towards implementing guiding principles analogous to those that are well recognised as being essential in securing the foundational underpinning of systematic reviews and the meta-analysis of clinical trials.
Collapse
Affiliation(s)
- Isabel Fortier
- Research Institute of the McGill University Health Centre, Montreal, QC, Canada
| | - Parminder Raina
- McMaster University, Department of Clinical Epidemiology and Biostatistics, Hamilton, ON, Canada
| | - Edwin R Van den Heuvel
- Eindhoven University of Technology, Department of Mathematics and Computer Science, Eindhoven, The Netherlands
| | - Lauren E Griffith
- McMaster University, Department of Clinical Epidemiology and Biostatistics, Hamilton, ON, Canada
| | - Camille Craig
- Research Institute of the McGill University Health Centre, Montreal, QC, Canada
| | - Matilda Saliba
- Research Institute of the McGill University Health Centre, Montreal, QC, Canada
| | - Dany Doiron
- Research Institute of the McGill University Health Centre, Montreal, QC, Canada
| | - Ronald P Stolk
- University Medical Center Groningen, Department of Epidemiology, Groningen, Groningen, The Netherlands
| | - Bartha M Knoppers
- McGill University, Centre of Genomics and Policy, Montreal, Montrreal, QC, Canada
| | - Vincent Ferretti
- Ontario Institute for Cancer Research, MaRS Centre, Toronto, ON, Canada
| | - Peter Granda
- University of Michigan, Inter-university Consortium for Political and Social Research (ICPSR), Ann Arbor, MI, USA
| | - Paul Burton
- University of Bristol, D2K Research Group, School of Social and Community Medicine, Bristol, UK
| |
Collapse
|
5
|
Park HS, Cho H, Kim HS. Development of an Integrated Biospecimen Database among the Regional Biobanks in Korea. Healthc Inform Res 2016; 22:129-41. [PMID: 27200223 PMCID: PMC4871843 DOI: 10.4258/hir.2016.22.2.129] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2016] [Revised: 04/19/2016] [Accepted: 04/22/2016] [Indexed: 11/23/2022] Open
Abstract
Objectives This study developed an integrated database for 15 regional biobanks that provides large quantities of high-quality bio-data to researchers to be used for the prevention of disease, for the development of personalized medicines, and in genetics studies. Methods We collected raw data, managed independently by 15 regional biobanks, for database modeling and analyzed and defined the metadata of the items. We also built a three-step (high, middle, and low) classification system for classifying the item concepts based on the metadata. To generate clear meanings of the items, clinical items were defined using the Systematized Nomenclature of Medicine Clinical Terms, and specimen items were defined using the Logical Observation Identifiers Names and Codes. To optimize database performance, we set up a multi-column index based on the classification system and the international standard code. Results As a result of subdividing 7,197,252 raw data items collected, we refined the metadata into 1,796 clinical items and 1,792 specimen items. The classification system consists of 15 high, 163 middle, and 3,588 low class items. International standard codes were linked to 69.9% of the clinical items and 71.7% of the specimen items. The database consists of 18 tables based on a table from MySQL Server 5.6. As a result of the performance evaluation, the multi-column index shortened query time by as much as nine times. Conclusions The database developed was based on an international standard terminology system, providing an infrastructure that can integrate the 7,197,252 raw data items managed by the 15 regional biobanks. In particular, it resolved the inevitable interoperability issues in the exchange of information among the biobanks, and provided a solution to the synonym problem, which arises when the same concept is expressed in a variety of ways.
Collapse
Affiliation(s)
- Hyun Sang Park
- Department of Medical Informatics, Kyungpook National University, Daegu, Korea
| | - Hune Cho
- Department of Medical Informatics, Kyungpook National University, Daegu, Korea
| | - Hwa Sun Kim
- Faculty of Medical Industry Convergence, Daegu Haany University, Gyeongsan, Korea
| |
Collapse
|
6
|
Carter KW, Francis RW, Carter KW, Francis RW, Bresnahan M, Gissler M, Grønborg TK, Gross R, Gunnes N, Hammond G, Hornig M, Hultman CM, Huttunen J, Langridge A, Leonard H, Newman S, Parner ET, Petersson G, Reichenberg A, Sandin S, Schendel DE, Schalkwyk L, Sourander A, Steadman C, Stoltenberg C, Suominen A, Surén P, Susser E, Sylvester Vethanayagam A, Yusof Z. ViPAR: a software platform for the Virtual Pooling and Analysis of Research Data. Int J Epidemiol 2015; 45:408-416. [PMID: 26452388 PMCID: PMC4864874 DOI: 10.1093/ije/dyv193] [Citation(s) in RCA: 37] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
Background:
Research studies exploring the determinants of disease require sufficient statistical power to detect meaningful effects. Sample size is often increased through centralized pooling of disparately located datasets, though ethical, privacy and data ownership issues can often hamper this process. Methods that facilitate the sharing of research data that are sympathetic with these issues and which allow flexible and detailed statistical analyses are therefore in critical need. We have created a software platform for the Virtual Pooling and Analysis of Research data (ViPAR), which employs free and open source methods to provide researchers with a web-based platform to analyse datasets housed in disparate locations.
Methods:
Database federation permits controlled access to remotely located datasets from a central location. The Secure Shell protocol allows data to be securely exchanged between devices over an insecure network. ViPAR combines these free technologies into a solution that facilitates ‘virtual pooling’ where data can be temporarily pooled into computer memory and made available for analysis without the need for permanent central storage.
Results:
Within the ViPAR infrastructure, remote sites manage their own harmonized research dataset in a database hosted at their site, while a central server hosts the data federation component and a secure analysis portal. When an analysis is initiated, requested data are retrieved from each remote site and virtually pooled at the central site. The data are then analysed by statistical software and, on completion, results of the analysis are returned to the user and the virtually pooled data are removed from memory.
Conclusions:
ViPAR is a secure, flexible and powerful analysis platform built on open source technology that is currently in use by large international consortia, and is made publicly available at [
http://bioinformatics.childhealthresearch.org.au/software/vipar/
].
Collapse
Affiliation(s)
| | | | - K W Carter
- Telethon Kids Institute, University of Western Australia, Perth, WA, Australia
| | - R W Francis
- Telethon Kids Institute, University of Western Australia, Perth, WA, Australia
| | - M Bresnahan
- Department of Epidemiology, Mailman School of Public Health, Columbia University, New York, NY, USA, New York State Psychiatric Institute, New York, NY, USA
| | - M Gissler
- National Institute for Health and Welfare, Helsinki, Finland, NHV Nordic School of Public Health, Gothenburg, Sweden
| | - T K Grønborg
- Department of Public Health, University of Aarhus, Aarhus, Denmark
| | - R Gross
- Division of Psychiatry, Sheba Medical Center, Tel Hashomer, Israel, Department of Epidemiology and Preventive Medicine, Sackler Faculty of Medicine, Tel Aviv University, Ramat Aviv, Israel
| | - N Gunnes
- Norwegian Institute of Public Health, Oslo, Norway
| | - G Hammond
- Telethon Kids Institute, University of Western Australia, Perth, WA, Australia
| | - M Hornig
- Department of Epidemiology, Mailman School of Public Health, Columbia University, New York, NY, USA, Center for Infection and Immunity, Mailman School of Public Health, Columbia University, New York, NY, USA
| | | | | | - A Langridge
- Telethon Kids Institute, University of Western Australia, Perth, WA, Australia
| | - H Leonard
- Telethon Kids Institute, University of Western Australia, Perth, WA, Australia
| | - S Newman
- Institute of Psychiatry, King's College London, London, UK
| | - E T Parner
- Department of Public Health, University of Aarhus, Aarhus, Denmark
| | | | - A Reichenberg
- Department of Psychosis Studies, Institute of Psychiatry, King's College London, London, UK, Departments of Preventative Medicine and Psychiatry, Ischan School of Medicine at Mount Sinai, New York, NY, USA
| | - S Sandin
- Karolinska Institutet, Stockholm, Sweden
| | - D E Schendel
- Department of Public Health, Section for Epidemiology, University of Aarhus, Aarhus, Denmark, Department of Economics and Business, National Centre for Register-based Research, University of Aarhus, Aarhus, Denmark, Lundbeck Foundation Initiative for Integrative Psychiatric Research, iPSYCH, Copenhagen, Denmark
| | - L Schalkwyk
- Institute of Psychiatry, King's College London, London, UK
| | - A Sourander
- Child Psychiatry Research Center, Department of Child Psychiatry, Turku University, Turku, Finland, Turku University Hospital, Turku, Finland
| | - C Steadman
- Telethon Kids Institute, University of Western Australia, Perth, WA, Australia
| | - C Stoltenberg
- Norwegian Institute of Public Health, Oslo, Norway, Department of Global Public Health and Primary Care, University of Bergen, Bergen, Norway
| | - A Suominen
- Department of Child Psychiatry, Turku University, Turku, Finland and
| | - P Surén
- Norwegian Institute of Public Health, Oslo, Norway
| | - E Susser
- Department of Epidemiology, Mailman School of Public Health, Columbia University, New York, NY, USA, New York State Psychiatric Institute, New York, NY, USA
| | | | - Z Yusof
- Karolinska Institutet, Stockholm, Sweden
| | | |
Collapse
|
7
|
Data harmonization and federated analysis of population-based studies: the BioSHaRE project. Emerg Themes Epidemiol 2013; 10:12. [PMID: 24257327 PMCID: PMC4175511 DOI: 10.1186/1742-7622-10-12] [Citation(s) in RCA: 90] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2013] [Accepted: 11/11/2013] [Indexed: 01/08/2023] Open
Abstract
Abstracts
Collapse
|
8
|
Schendel DE, Bresnahan M, Carter KW, Francis RW, Gissler M, Grønborg TK, Gross R, Gunnes N, Hornig M, Hultman CM, Langridge A, Lauritsen MB, Leonard H, Parner ET, Reichenberg A, Sandin S, Sourander A, Stoltenberg C, Suominen A, Surén P, Susser E. The International Collaboration for Autism Registry Epidemiology (iCARE): multinational registry-based investigations of autism risk factors and trends. J Autism Dev Disord 2013; 43:2650-63. [PMID: 23563868 PMCID: PMC4512211 DOI: 10.1007/s10803-013-1815-x] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
The International Collaboration for Autism Registry Epidemiology (iCARE) is the first multinational research consortium (Australia, Denmark, Finland, Israel, Norway, Sweden, USA) to promote research in autism geographical and temporal heterogeneity, phenotype, family and life course patterns, and etiology. iCARE devised solutions to challenges in multinational collaboration concerning data access security, confidentiality and management. Data are obtained by integrating existing national or state-wide, population-based, individual-level data systems and undergo rigorous harmonization and quality control processes. Analyses are performed using database federation via a computational infrastructure with a secure, web-based, interface. iCARE provides a unique, unprecedented resource in autism research that will significantly enhance the ability to detect environmental and genetic contributions to the causes and life course of autism.
Collapse
Affiliation(s)
- Diana E Schendel
- Department of Public Health and Department of Economics and Business, University of Aarhus, 8000, Aarhus C, Denmark,
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
9
|
Linkage of Data from Diverse Data Sources (LDS): A Data Combination Model Provides Clinical Data of Corresponding Specimens in Biobanking Information System. J Med Syst 2013; 37:9975. [DOI: 10.1007/s10916-013-9975-y] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2013] [Accepted: 08/29/2013] [Indexed: 11/26/2022]
|
10
|
Boomsma DI, Willemsen G, Vink JM, Bartels M, Groot P, Hottenga JJ, van Beijsterveldt CEMT, Stroet T, van Dijk R, Wertheim R, Visser M, van der Kleij F. Design and Implementation of a Twin-Family Database for Behavior Genetics and Genomics Studies. Twin Res Hum Genet 2012; 11:342-8. [DOI: 10.1375/twin.11.3.342] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
AbstractIn this article we describe the design and implementation of a database for extended twin families. The database does not focus on probands or on index twins, as this approach becomes problematic when larger multigenerational families are included, when more than one set of multiples is present within a family, or when families turn out to be part of a larger pedigree. Instead, we present an alternative approach that uses a highly flexible notion of persons and relations. The relations among the subjects in the database have a one-to-many structure, are user-definable and extendible and support arbitrarily complicated pedigrees. Some additional characteristics of the database are highlighted, such as the storage of historical data, predefined expressions for advanced queries, output facilities for individuals and relations among individuals and an easy-to-use multi-step wizard for contacting participants. This solution presents a flexible approach to accommodate pedigrees of arbitrary size, multiple biological and nonbiological relationships among participants and dynamic changes in these relations that occur over time, which can be implemented for any type of multigenerational family study.
Collapse
|
11
|
Wichmann HE, Kuhn KA, Waldenberger M, Schmelcher D, Schuffenhauer S, Meitinger T, Wurst SHR, Lamla G, Fortier I, Burton PR, Peltonen L, Perola M, Metspalu A, Riegman P, Landegren U, Taussig MJ, Litton JE, Fransson MN, Eder J, Cambon-Thomsen A, Bovenberg J, Dagher G, van Ommen GJ, Griffith M, Yuille M, Zatloukal K. Comprehensive catalog of European biobanks. Nat Biotechnol 2011; 29:795-7. [DOI: 10.1038/nbt.1958] [Citation(s) in RCA: 69] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
12
|
|
13
|
Späth MB, Grimson J. Applying the archetype approach to the database of a biobank information management system. Int J Med Inform 2010; 80:205-26. [PMID: 21131230 DOI: 10.1016/j.ijmedinf.2010.11.002] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2010] [Revised: 11/01/2010] [Accepted: 11/02/2010] [Indexed: 11/17/2022]
Abstract
PURPOSE The purpose of this study is to investigate the feasibility of applying the openEHR archetype approach to modelling the data in the database of an existing proprietary biobank information management system. A biobank information management system stores the clinical/phenotypic data of the sample donor and sample related information. The clinical/phenotypic data is potentially sourced from the donor's electronic health record (EHR). The study evaluates the reuse of openEHR archetypes that have been developed for the creation of an interoperable EHR in the context of biobanking, and proposes a new set of archetypes specifically for biobanks. The ultimate goal of the research is the development of an interoperable electronic biomedical research record (eBMRR) to support biomedical knowledge discovery. METHODS The database of the prostate cancer biobank of the Irish Prostate Cancer Research Consortium (PCRC), which supports the identification of novel biomarkers for prostate cancer, was taken as the basis for the modelling effort. First the database schema of the biobank was analyzed and reorganized into archetype-friendly concepts. Then, archetype repositories were searched for matching archetypes. Some existing archetypes were reused without change, some were modified or specialized, and new archetypes were developed where needed. The fields of the biobank database schema were then mapped to the elements in the archetypes. Finally, the archetypes were arranged into templates specifically to meet the requirements of the PCRC biobank. RESULTS A set of 47 archetypes was found to cover all the concepts used in the biobank. Of these, 29 (62%) were reused without change, 6 were modified and/or extended, 1 was specialized, and 11 were newly defined. These archetypes were arranged into 8 templates specifically required for this biobank. A number of issues were encountered in this research. Some arose from the immaturity of the archetype approach, such as immature modelling support tools, difficulties in defining high-quality archetypes and the problem of overlapping archetypes. In addition, the identification of suitable existing archetypes was time-consuming and many semantic conflicts were encountered during the process of mapping the PCRC BIMS database to existing archetypes. These include differences in the granularity of documentation, in metadata-level versus data-level modelling, in terminologies and vocabularies used, and in the amount of structure imposed on the information to be recorded. Furthermore, the current way of modelling the sample entity was found to be cumbersome in the sample-centric activity of biobanking. CONCLUSIONS The archetype approach is a promising approach to create a shareable eBMRR based on the study participant/donor for biobanks. Many archetypes originally developed for the EHR domain can be reused to model the clinical/phenotypic and sample information in the biobank context, which validates the genericity of these archetypes and their potential for reuse in the context of biomedical research. However, finding suitable archetypes in the repositories and establishing an exact mapping between the fields in the PCRC BIMS database and the elements of existing archetypes that have been designed for clinical practice can be challenging and time-consuming and involves resolving many common system integration conflicts. These may be attributable to differences in the requirements for information documentation between clinical practice and biobanking. This research also recognized the need for better support tools, modelling guidelines and best practice rules and reconfirmed the need for better domain knowledge governance. Furthermore, the authors propose that the establishment of an independent sample record with the sample as record subject should be investigated. The research presented in this paper is limited by the fact that the new archetypes developed during this research are based on a single biobank instance. These new archetypes may not be complete, representing only those subsets of items required by this particular database. Nevertheless, this exercise exposes some of the gaps that exist in the archetype modelling landscape and highlights the concepts that need to be modelled with archetypes to enable the development of an eBMRR.
Collapse
Affiliation(s)
- Melanie Bettina Späth
- Centre for Health Informatics, School of Computer Science and Statistics, Trinity College Dublin, Dublin 2, Ireland.
| | | |
Collapse
|
14
|
Kim H, Yi BK, Kim IK, Kwak YS. Integrating Clinical Information in National Biobank of Korea. J Med Syst 2009; 35:647-56. [DOI: 10.1007/s10916-009-9402-6] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2009] [Accepted: 11/16/2009] [Indexed: 02/04/2023]
|
15
|
Baker EJ, Jay JJ, Philip VM, Zhang Y, Li Z, Kirova R, Langston MA, Chesler EJ. Ontological Discovery Environment: a system for integrating gene-phenotype associations. Genomics 2009; 94:377-87. [PMID: 19733230 DOI: 10.1016/j.ygeno.2009.08.016] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2009] [Revised: 08/19/2009] [Accepted: 08/27/2009] [Indexed: 10/20/2022]
Abstract
The wealth of genomic technologies has enabled biologists to rapidly ascribe phenotypic characters to biological substrates. Central to effective biological investigation is the operational definition of the process under investigation. We propose an elucidation of categories of biological characters, including disease relevant traits, based on natural endogenous processes and experimentally observed biological networks, pathways and systems rather than on externally manifested constructs and current semantics such as disease names and processes. The Ontological Discovery Environment (ODE) is an Internet accessible resource for the storage, sharing, retrieval and analysis of phenotype-centered genomic data sets across species and experimental model systems. Any type of data set representing gene-phenotype relationships, such quantitative trait loci (QTL) positional candidates, literature reviews, microarray experiments, ontological or even meta-data, may serve as inputs. To demonstrate a use case leveraging the homology capabilities of ODE and its ability to synthesize diverse data sets, we conducted an analysis of genomic studies related to alcoholism. The core of ODE's gene set similarity, distance and hierarchical analysis is the creation of a bipartite network of gene-phenotype relations, a unique discrete graph approach to analysis that enables set-set matching of non-referential data. Gene sets are annotated with several levels of metadata, including community ontologies, while gene set translations compare models across species. Computationally derived gene sets are integrated into hierarchical trees based on gene-derived phenotype interdependencies. Automated set identifications are augmented by statistical tools which enable users to interpret the confidence of modeled results. This approach allows data integration and hypothesis discovery across multiple experimental contexts, regardless of the face similarity and semantic annotation of the experimental systems or species domain.
Collapse
Affiliation(s)
- Erich J Baker
- Department of Computer Science, Baylor University, Waco, TX, USA
| | | | | | | | | | | | | | | |
Collapse
|
16
|
Information Systems for Federated Biobanks. TRANSACTIONS ON LARGE-SCALE DATA- AND KNOWLEDGE-CENTERED SYSTEMS I 2009. [DOI: 10.1007/978-3-642-03722-1_7] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
|
17
|
|
18
|
Abstract
Efforts are underway to define a national framework for secondary analysis of health-related data. In the meantime, regional health databases have been constructed using insurance claims data, clinical data from single large health care providers, clinical data from multiple collaborating health care providers, and public health data. Large-scale survey data also are available in government databases. Clinical laboratory results are an important component of all these databases because they can provide validation for manually assigned diagnostic and procedure codes and can support inference of key information not provided by coding, such as severity of disease and prevalence of risk factors.
Collapse
Affiliation(s)
- James H Harrison
- Department of Public Health Sciences, University of Virginia, Suite 3181 West Complex, 1335 Hospital Drive, Charlottesville, VA 22908, USA.
| | | |
Collapse
|
19
|
Abstract
Biomedical data useful for data mining are often distributed across multiple databases. These databases may be aggregated using several techniques to create single data sets that may be mined using standard approaches; however, separate databases may, in their design or data representation, capture information that is analytically useful and that is lost on integration. Recent techniques for mining multiple databases simultaneously but separately may preserve and leverage the unique perspectives within each database. This article presents an example, "dual mining," in which concurrent analysis of a target database with a related knowledge base can improve the identification of association patterns in the target most likely to be of interest for further analysis.
Collapse
Affiliation(s)
- Mir S Siadaty
- Division of Clinical Informatics, Department of Public Health Sciences, University of Virginia, Suite 3181 West Complex, 1335 Hospital Drive Charlottesville, VA 22908, USA.
| | | |
Collapse
|
20
|
Perola M, Sammalisto S, Hiekkalinna T, Martin NG, Visscher PM, Montgomery GW, Benyamin B, Harris JR, Boomsma D, Willemsen G, Hottenga JJ, Christensen K, Kyvik KO, Sørensen TIA, Pedersen NL, Magnusson PKE, Spector TD, Widen E, Silventoinen K, Kaprio J, Palotie A, Peltonen L. Combined genome scans for body stature in 6,602 European twins: evidence for common Caucasian loci. PLoS Genet 2007; 3:e97. [PMID: 17559308 PMCID: PMC1892350 DOI: 10.1371/journal.pgen.0030097] [Citation(s) in RCA: 132] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2007] [Accepted: 05/02/2007] [Indexed: 01/06/2023] Open
Abstract
Twin cohorts provide a unique advantage for investigations of the role of genetics and environment in the etiology of variation in common complex traits by reducing the variance due to environment, age, and cohort differences. The GenomEUtwin (http://www.genomeutwin.org) consortium consists of eight twin cohorts (Australian, Danish, Dutch, Finnish, Italian, Norwegian, Swedish, and United Kingdom) with the total resource of hundreds of thousands of twin pairs. We performed quantitative trait locus (QTL) analysis of one of the most heritable human complex traits, adult stature (body height) using genome-wide scans performed for 3,817 families (8,450 individuals) derived from twin cohorts from Australia, Denmark, Finland, Netherlands, Sweden, and United Kingdom with an approximate ten-centimorgan microsatellite marker map. The marker maps for different studies differed and they were combined and related to the sequence positions using software developed by us, which is publicly available (https://apps.bioinfo.helsinki.fi/software/cartographer.aspx). Variance component linkage analysis was performed with age, sex, and country of origin as covariates. The covariate adjusted heritability was 81% for stature in the pooled dataset. We found evidence for a major QTL for human stature on 8q21.3 (multipoint logarithm of the odds 3.28), and suggestive evidence for loci on Chromosomes X, 7, and 20. Some evidence of sex heterogeneity was found, however, no obvious female-specific QTLs emerged. Several cohorts contributed to the identified loci, suggesting an evolutionarily old genetic variant having effects on stature in European-based populations. To facilitate the genetic studies of stature we have also set up a website that lists all stature genome scans published and their most significant loci (http://www.genomeutwin.org/stature_gene_map.htm). Twin cohorts provide a unique advantage for research of the role of genetics and environment behind common complex traits by reducing the variance due to environment, age, and cohort differences. The GenomEUtwin consortium consists of eight twin cohorts with the total resource of hundreds of thousands of twin pairs (http://www.genomeutwin.org). We performed quantitative family-based genetic linkage analysis for one of the most heritable human complex traits, adult stature (body height), using genome-wide scans derived from twin cohorts from Australia, Denmark, Finland, Netherlands, Sweden, and United Kingdom. Age, sex, and country were adjusted for in the data analyses. Human stature was found to be very heritable across all the cohorts and in the combined dataset. We found evidence for a shared genetic locus accounting for human stature on Chromosome 8, and suggestive evidence for loci on Chromosomes X, 7, and 20. Since twins from several countries contributed to the identified loci, an evolutionarily old genetic variant must influence stature in European-based populations. To facilitate the research in the field we have also set up a website that lists all stature genome scans published and their most significant loci (http://www.genomeutwin.org/stature_gene_map.htm).
Collapse
Affiliation(s)
- Markus Perola
- Department of Molecular Medicine, National Public Health Institute, Helsinki, Finland
- Faculty of Medicine, Department of Medical Genetics, University of Helsinki, Helsinki, Finland
| | - Sampo Sammalisto
- Department of Molecular Medicine, National Public Health Institute, Helsinki, Finland
| | - Tero Hiekkalinna
- Department of Molecular Medicine, National Public Health Institute, Helsinki, Finland
| | - Nick G Martin
- Queensland Institute of Medical Research, Brisbane, Australia
| | | | | | - Beben Benyamin
- Queensland Institute of Medical Research, Brisbane, Australia
| | | | | | | | | | - Kaare Christensen
- Department of Epidemiology, Institute of Public Health, University of Southern Denmark, Odense, Denmark
| | - Kirsten Ohm Kyvik
- Department of Epidemiology, Institute of Public Health, University of Southern Denmark, Odense, Denmark
| | | | | | | | | | - Elisabeth Widen
- Finnish Genome Center, University of Helsinki, Helsinki, Finland
| | - Karri Silventoinen
- Faculty of Medicine, Department of Public Health, University of Helsinki, Helsinki, Finland
| | - Jaakko Kaprio
- Faculty of Medicine, Department of Public Health, University of Helsinki, Helsinki, Finland
- Department of Mental Health and Alcohol Research, National Public Health Institute, Helsinki, Finland
| | - Aarno Palotie
- Finnish Genome Center, University of Helsinki, Helsinki, Finland
| | - Leena Peltonen
- Department of Molecular Medicine, National Public Health Institute, Helsinki, Finland
- Faculty of Medicine, Department of Medical Genetics, University of Helsinki, Helsinki, Finland
- The Broad Institute, Massachusetts Institute of Technology, Boston, Massachusetts, United States of America
- * To whom correspondence should be addressed. E-mail:
| | | |
Collapse
|