Detection of faults in database applications using static program analysis and data mining
Date of Issue2016-04-20
School of Electrical and Electronic Engineering
This thesis presents approaches for detecting faults such as violations of constraints in databases and anomaly in the usage of database attributes of database applications. It also introduces a testing method of generating test cases to detect faults of database applications. The design purpose of Database Management Systems is to solve complicated requirements emerged for offering data consistence as well as persistence. Application examples include concurrent data accessing, complicated transactions execution as well as data analysis upon datasets of large sizes. Most of the up-to-date business systems depend on such Database Management Systems of generality. Hence it is critical to guarantee the correctness of database operations of practical database applications. Current approaches to detect faults of database applications suffer from some major drawbacks that lead to frequent occurrences of faults. Most of the approaches do not take the interaction between program source code and the database structured query language queries into consideration. Such interaction is actually the key component of database applications because it constitutes the most of the business logic. Furthermore, current methods regarding the impact analysis of database applications either only consider the individual database queries or analyze the program flow dependencies. However, data flow between programs and the Database Management Systems also plays an important role on impact analysis of database applications. In addition, classification techniques are used for anomaly detection for a long time, but the quantity of training data required for effective classification is typically large. Manual creation of these training data is time-consuming and tedious. Moreover, since the requirement specification keeps evolving, behaviors which are currently normal might become abnormal in the future. For these reasons, classification-based fault detection algorithms, which rely on labeled data, are often inaccurate and highly expensive. Last but not least, rather than using single declarative language, database applications usually are composed by a variety of both imperative and declarative languages. Therefore, existing testing approaches of software systems which are used for imperative languages cannot be applied directly to database applications. Hence, it is clear that alternative solutions, which are easy to use and yet effective, are required to comprehensively analyze database applications to find faults. Based on the above motivations, in this thesis, we propose four novel approaches based on prominent static program analysis and data mining techniques. In relational database, key and referential constraints are key components to ensure accuracy and consistency of data in database management system. Most Database Management Systems automatically enforce key and referential constraints and decline any operations which would lead to constraints violation. However, exception handling for such rejections still requires extra coding efforts by programmers. Current research mainly focuses on maintaining the enforcement of constraints of databases. No research has explored the automatic exception handling for the violations of database constraints. We propose an approach to automatically generate and insert the exception handling code for structured query language queries for the source code in need. This helps to improve programming efficiency and also aids in avoiding coding errors from exception handling and preventing neglect or inconsistent action for handling the same category of exceptions. As database applications are becoming more and more complicated, this rising complexity calls for more frequent updating in these applications. However, little work has been done in the field of software maintenance research targeting specifically on database applications. Existing approaches do not provide comprehensive information on the dependencies involved in the structured query language queries, so maintainers still have to manually inspect large chunks of code to analyze the impact whenever there is a change. To complement existing approaches, a novel graph structure called the attribute dependency graph is proposed to unveil the intricacy between database attributes and the involved source code. This approach relies on conventional inter-procedural static program analysis to extract the attribute dependency graph. We propose a clustering-based anomaly detection approach to detect anomaly in the usage of database attributes. We abstract and characterize database operations performed on a database attribute by a feature vector extracted from code through static program analysis. We propose a method to separate database attributes into different clusters by applying a distance-based metric. When the clusters of database attributes have formed, small clusters are identified and labeled as anomalous with the assumption that abnormal attribute would have high chance to be outlier of the dataset due to its rarity. Then the obtained cluster model can be used to detect anomalies from unseen database applications. Our anomaly detection model provides an alternative solution to existing fault detection approaches of database application. To address the problem that traditional testing method may not be applicable for database applications, we propose a testing approach for the coverage of attribute lifecycles. We also propose a test coverage analysis to measure the quality of a test suite. This thesis also presents experimental evaluation of each proposed approach and demonstrates that the approaches are useful and effective.