Spam analysis and detection on microblog
Date of Issue2017-02-28
School of Computer Science and Engineering
Micro-blogging platforms such as Twitter and Weibo are popular platforms for information collection and dissemination. Due to ease of posting content and the huge social graph, micro-blogging platforms have attracted the attention of not only legitimate users but also spammers. Increasing activities of spammers on micro-blogging platforms make spam a serious problem which also affect user experience. Due to quick propagation of content in microblogging services, it is highly desirable to detect spam in real-time to minimize its impact. Hence, real-time spam detection technique that leverage fine-grained information is essential. As Twitter is one of the most popular micro-blogging platforms, in this dissertation we used Twitter dataset to study spam on microblog. Spam detection is an active area of research on Twitter. Most of the studies are focused on identifying spammers. Spam issues on Twitter are tackled by blocking spammers detected by the system. A user account that mistakenly grant permission to a malicious third party application may get blocked due to the posts by the application on behalf of the user. Similarly, compromised microblog accounts may get blocked because of the tweets posted by hijackers. Further, spammers are generally detected only after posting many spam tweets. To address the issue of spam on Twitter, we have used tweet-level spam detection is necessary. Tweet-level spam detection is inherently a challenging task as tweet is short and noisy text. Spam tweets heavily exploit hashtags to promote the tweets to the wider audience. Hence, we focused on hashtag oriented spam detection by collecting tweets using trending hashtags as a query. Further, we propose an effective way of labeling tweets for generating a dataset for such task. To the best our knowledge, there was no any benchmark dataset hence we present HSpam14, a public dataset consisting of 14 million tweets labeled with the proposed labeling technique. We made a detailed tweet-level analysis based on hashtags and tweet content, and user-level analysis based on user profiles. Detailed understanding about spam tweets and legitimate tweets, which are also know as ham tweets, are utilized to design spam tweet detection system. Unlabeled tweets are easy to obtain and they can be utilized to improve system performance and to deal with evolving spamming activities. Hence, we proposed semi-supervised real-time spam detection system to effectively identify spam tweets. Spams in a microblog introduce problems in several functionalities such as search, recommendation, and text analysis. As a case study, we analyze the effect of spam tweets on hashtag recommendation using the HSpam14 dataset. We observed that features and methods that are effective for spam tweet collection may not be effective for legitimate tweets. Our study shows that experiment conducted on spammy dataset gives misleading results. Hence, it is crucial to perform spam filtering before conducting any analysis on Twitter. In a nutshell, this dissertation elucidates the effectiveness of tweet-level spam detection and different aspects of spam and ham tweets and also paves the way for further research on fine-grained spam detection on microblog.
DRNTU::Engineering::Computer science and engineering::Computer applications::Social and behavioral sciences