Two clustering problems in analyzing next generation sequencing data
Date of Issue2016
School of Physical and Mathematical Sciences
As the next generation sequencing (NGS) becomes the dominating technology for studying the gene expression profiles, downstream statistical analysis tools are needed urgently. Clustering samples is an important approach to revealing smaples’ relationships, such as for the discovery of new subtypes of cancer cells. To cluster high dimensional data, it is also of interest to select the variables (genes) informative for clustering. A new penalized model-based method called PMixClus is presented in this thesis to select genes and perform clustering simultaneously. The negative binomial mixture model is developed for the nonnegative and discrete count data from RNA sequencing experiments. Moreover, our method can automatically determine the number of clusters using the Bayesian information criterion. Additionally, in the PMixClus hybridhierarchical tree guided by the output from model-based clustering can be applied to visualize partial clustering structure in a hierarchical way. Results of both simulated and real data demonstrate that our method perform better or equally well compared to other competitive methods. DNA methylation is a significant epigenetic modification to regulate gene transcription and plays a critical role in diseases. The whole genome bisulfite sequencing (WGBS) is a specific NGS technology for the detection of genome-wide DNA methylation at a single CpG site resolution. However, the high cost of such experiments and the complexity of data challenges the downstream analysis. We proposed a new tool called DMReSearch to identify differentially methylated regions (DMRs) based on the WGBS data. We developed a three-dimensional rank method to pre-cluster the CpG sites, which considers CpG density, distance between centers and fluctuation of differences between two biological groups. Then we smoothed the methylation levels in each cluster with a modified local kernel smoother, carried out statistical test at each CpG by using the beta-binomial distribution and accordingly trimed and merged the identified DMRs. We compared our method to BSmooth which is the most popular method to detect DMR based on WGBS data. In simulation experiments, DMReSearch presents better receiver operating characteristic curves. Real data experiments show that DMReSearch performs better smoothing results, reports less unreasonable DMRs and presents consistency between low- and high-coverage data sets.