Processing and analysis of DNA methylation and statistical integration with genetic and clinical data
Date of Issue2017-04-05
School of Computer Science and Engineering
Epigenetics is genetic regulation that is not directly encoded in the DNA sequence. DNA methylation is an important epigenetic mark and its levels at particular loci have been associated with health and disease states. Better understanding of DNA methylation may lead to development of improved therapeutics for these diseases. Recent technical advances such as high-throughput arrays and next generation sequencing have enabled discovery of epigenetic correlates of clinical phenotypes encompassing hundreds of clinical samples at genome wide level with single base resolution. Epigenome wide data on large number of samples has necessitated bio-computing with efficient analysis of well-designed experiments, new data analytic pipelines, mathematical models and biological interpretation. This thesis addresses three major challenges. First, it is important to optimize the detection of epigenetic variation and minimize the impact of technical artefacts for a reliable analysis. Second, algorithms must be very efficient because multivariate statistical models are essential in association studies and they require high computational power when applied to genome-wide studies. Third, computational methods should facilitate biological discovery through large scaled data analyses. Corresponding to the challenges in epigenetics, three major contributions are presented in this thesis. First, a new pipeline has been developed to process and remove technical artefacts in genome-wide DNA methylation data generated by the most commonly used Illumnia Infinium HumanMethylation 450K array. Compared to pre-existing algorithms, data processed through this pipeline was in a better agreement with the data obtained from reduced representation bisulfite sequencing on the same clinical samples. This study was further extended to evaluate the emerging next generation sequencing technology, Methyl Capture Sequencing, in buccal clinical samples. This thesis provides a comprehensive comparison of array and sequencing data, across key functional genomic regions in terms of their coverage and concordance of methylation calls and the use in epigenomic wide analysis study. The second part presents a suite of statistical models developed to study the complicated relationships between Gene, Environment and Methylation. Three major functions GEM_Emodel, GEM_Gmodel and GEM_GxEmodel have been developed into an R package named GEM. Using matrix based iterative correlation and memory-efficient data analysis, GEM facilitates reliable millions of associations between DNA methylation, genetic variants and environmental factors within minutes, in a standard computational setting. GEM has been validated by comprehensive benchmarking and has now become a part of Bioconductor, an extensively used open source bioinformatics suite. Lastly, GEM was employed to study the DNA methylation and its integration with genetic variants and environmental influences of multi-ethnicity Asian neonates from a Singapore based birth cohort (Growing Up in Singapore Towards healthy Outcomes, GUSTO) and discover methylation changes associated with sub-optimal health outcomes in early life. In an analysis of 237 GUSTO neonatal methylomes, we found methylation quantitative trait loci were readily detected and the best explanation for 75% of the most variably methylated regions was due to the interaction of genotype with in utero environments. This study shed new light on the complex relationship between biological inheritance and individual prenatal experience suggesting the importance of considering both genetic variation and environmental factors in interpreting epigenetic variation. The GEM models were also applicable in finding that HIF3A DNA methylation measured in the umbilical cord of 991 newborns can aid understanding the genesis of adiposity at birth.