Machine learning techniques for program representation and comprehension with applications to mobile security
Date of Issue2017-10-09
School of Electrical and Electronic Engineering
Android has evolved to be the most popular mobile operating system over the past several years, since its inception. Millions of Android apps provide a variety of functionalities to users, such as instant messaging, gaming and online shopping. However, due to its prevalence, Android also becomes a prime attack target of cybercriminals. According to Symantec’s 2016 Annual Threat Report, 13 million Android malware have been found in the wild, so far. These malware may steal sensitive information, control devices remotely or encrypt on-device data for ransom thus putting the end users at high risk. Besides malware authors, some unscrupulous developers clone the code from popular apps and repackage it with advertisements and new functionalities, thus stealing revenue from the original developers. The sheer volume, growth rate and evolution of malware and clone apps highlight an imperative need for developing effective and scalable automated techniques to detect them. To perform automated detection, recent approaches both from academia and industry increasingly resort to Machine Learning (ML) techniques. Typically, the detection process involves extracting semantic features from suitable representations of programs (e.g., assembly code, call graphs) and identifying malice or clone code patterns. However, most of these ML algorithms could be applied only on data represented as vectors. Hence, a pivotal factor in determining the effectiveness of these detection processes, is building suitable vector representations of programs. We intend to address this as the primary motive of this thesis. Recognizing that higher level abstractions of programs such as call graphs, control and data-flow graphs retain their semantics well, we intend to learn their representations (i.e., graph embeddings) and use them to perform malware and clone detection. We use a common term, Program Representation Graphs (PRGs) to refer to any of the aforementioned graphs. Once appropriate PRG embeddings are built, as a secondary motive, we intend to address specific issues in malware and clone detection processes. In the case of malware detection, three issues: (1) population drift induced by malware’s evolution, (2) systematic integration of malware characteristics (i.e., features) from different data sources and (3) precisely locating malicious code portions in PRGs have been addressed. In the case of clone detection, detecting semantic clones (apart from detecting syntactic clones) which remains as a crucial challenge has been addressed. The following are the achievements made in this thesis: 1. We propose a novel graph kernel specifically designed to address the malware detection problem. Previous research has revealed that besides capturing topological neighbourhoods (i.e., structural information) from these graphs it is important to capture the context under which the neighbourhoods are reachable to accurately detect malicious neighbourhoods. We observe that state-of-the-art graph kernels, such as Weisfeiler-Lehman Kernel (WLK) capture the structural information well, but fail to capture contextual information. To address this, we develop the Contextual WLK (CWLK) which is capable of capturing both these types of information. To the best of our knowledge, this is the first graph kernel specifically addressing a problem from the field of program analysis. 2. It is well-known that malware constantly evolves so as to evade detection. This causes the entire malware population to be non-stationary. Contrary to this fact, most of the prior works on ML based malware detection have assumed that the distribution of the observed malware features does not change or evolve over time. We address the problem of malware population drift and propose a novel online learning based framework to detect malware, named Casandra (Context-aware, Adaptive and Scalable ANDRoid mAlware detector). Towards performing accurate and scalable detection, Casandra uses CWLK to capture and represent security-sensitive behaviors in PRGs. Towards automatically adapting to population drift, it uses an online classifier. When evaluated with more than 87,000 apps, Casandra achieves 89.92% accuracy, outperforming existing techniques by more than 25% in their typical batch learning setting and more than 7% when they are continuously retained. 3. Existing malware detection approaches have typically used classifiers with a variety of features such as security-sensitive APIs, instruction sequences and information flows. These feature sets provide complementary perspectives (interchangeably referred as views) of apps’ behaviors with inherent strengths and limitations. Meaning, some views are more amenable to detect certain attacks while they may not characterize several other attacks well. Existing approaches (incl. Casandra) use either one or a selected few of the aforementioned feature sets which prevents them from detecting a substantial majority of attacks. To address this, we propose MKLDroid, a unified framework that systematically integrates multiple views in the hopes that, while a malware app can disguise itself in some views, disguising in every view while maintaining malicious intent will prove to be substantially more difficult. MKLDroid leverages on CWLK for extracting PRG embeddings from different views and then employs Multiple Kernel Learning (MKL) to find a weighted combination of the views which yields best detection rates. Besides integrating multiple views, its salient trait is its ability to localize and rank malice code portions (e.g., classes/methods) in PRGs based on their degree of maliciousness. 4. We observe that existing graph kernels such as WLK exhibit an inherent inability to model subgraph-level similarities which results in poor generalization when used with large graphs (such as PRGs). To address this, we propose Subgraph2vec, a novel approach for learning latent representations of rooted subgraphs inspired by recent advancements in Deep Learning. We demonstrate that these subgraph vectors could be used for building a deep learning variant of WLK. This deep graph kernel exhibits potentials to detect semantic malware and clone variants. Our experiments on several large-scale datasets reveal that Subgraph2vec achieves significant improvements in generalization and thereby, on accuracies over existing kernels on both malware and clone detection tasks. In sum, this thesis proposes two methods for learning representations of PRGs namely, CWLK and Subgraph2vec. With the PRG embeddings thus built, we address four specific issues that plague Android malware and clone detection approaches, namely, population drift, integrating multi-view features for comprehensive detection, localization of malicious code portions and semantics based detection, through leveraging on ML techniques such as Online Learning, Kernel Methods, MKL and Deep Learning.