Effective graph-based algorithms for weak motif discovery in genomic sequences
Date of Issue2014
School of Computer Engineering
This thesis aims to improve weak motif discovery in genomic sequences. The task is of primary significance and urgency because motifs provide the basis for biologists to derive knowledge about gene functions. The knowledge could reveal mechanisms of diseases and lead to novel molecular targets for inventing therapeutic drugs. Nevertheless, due to prohibitive cost, traditional wet-lab techniques are no longer adequate for large scale data. In this regard, computational approaches can render valuable help. Computational discovery of weak motifs, however, remains challenging. Because many false instances of a degenerate motif can easily disguise the true ones, in spite of intensive research, performance of the existing algorithms for this problem is far from being satisfactory. Approximate algorithms based on Expectation Maximization or Gibbs Sampling can miss true instances; exact ones based on clique finding in graphs or generating-and validating patterns (candidate motifs) consume a large amount of time/space. Thus, there is much room for improving the algorithms. We propose three novel algorithms for discovering (weak) motifs from exact datasets, where each sequence contains at least one motif instance. 'freeMotif-BF is a treestructured algorithm, whose novelty lies in the construction of trees of motif instances in a breadth-first manner. Experiments demonstrate that 'freeMotif-BF is more scalable than the other existing algorithms, in terms of the length of motifs. However, 'freeMotifBF and many algorithms have difficulty in discovering very weak motifs due to enormous space requirement. Thus, the algorithm 'freeMotif-DF constructs trees in a depth-first manner, overcoming the space limitation. Another algorithm RecMotif finds cliques of motif instances in recursively constructed graphs also in a depth-first manner. RecMotif reduces space requirement significantly. Besides, it further improves efficiency in execution time for solving open challenge problems. We also propose two recursive algorithms for discovering motifs from noisy datasets, where some of the input sequences may contain no motif instances. The two generalized algorithms nTreeMotif and nRecMotif are improved from TreeMotif-BF and RecMotif respectively. The algorithms are based on efficient exclusion of noisy sequences and the improved construction of trees/ graphs. nTreeMotif and nRecMotif preserve accuracy and efficiency of TreeMotif-BF and RecMotif respectively for dealing with exact datasets. Moreover, they are more scalable in terms of the number of noisy sequences than the existing algorithms. The novel graph-based algorithms have successfully met the research objective. They can effectively discover weak motifs from datasets for which the existing algorithms have difficulty handling. Thus, they should be useful new additions to the repertoire of tools for bioinformatics.
DRNTU::Engineering::Computer science and engineering