Query processing in publish/subscribe systems for textual data streams
Date of Issue2016
School of Computer Engineering
Centre for Advanced Information Systems
With the rapid development of online social media (e.g., Facebook and Flickr) and micro-blogging services (e.g., Twitter, Tumblr, and Weibo), huge amounts of streaming text data are being generated in an unprecedented scale. Such data is particularly well-suited for information dissemination. The demand for disseminating interesting information from data stream to users gives prominence to content based publish/subscribe system, where users can personalize their requirements by issuing a subscription query and they will be notified when items matching those requirements are captured from the data stream. Although content based publish/subscribe system is successfully applied in many real-world applications, the existing work on content based publish/subscribe system has the following limitations. First, existing content based publish/subscribe systems usually do not consider the location aspect. With the deployment and use of GPS-enabled devices, spatial, or geographical, documents are emerging where content is associated with locations (e.g., Points of Interest on Google Map, check-ins on Foursquare, and geo-tagged tweets on Twitter). As a result of the development, users may want to issue subscription queries with both keyword and location requirements. For instance, a user who subscribes for promotional information of seafood restaurants may be only interested in the information posted by nearby seafood restaurants. Second, existing publish/subscribe systems do not consider the issue of query result diversification, which has drawn considerable attention as a way to increase user satisfaction in web search. To overcome the first limitation, we conduct the first study on location-aware publish/subscribe for textual data stream. Specifically, we propose a new type of subscription query, Boolean Range Continuous (BRC) query, for publish/subscribe systems, which continuously finds spatio-temporal documents whose locations fall in the query region and textual information satisfies the query Boolean predicates over a data stream. We develop an efficient system for addressing the problem. To improve the quality of results returned by each subscription query, we propose a new type of location based subscription query, Temporal Spatial-Keyword Top-k Subscription (TaSK) query, that rank-orders spatio-temporal documents and continuously maintains the top-ranked documents based on a score that considers the following three aspects: (1) Text relevance; (2) Spatial proximity; (3) Recency of document. We develop an efficient approach to maintaining the up-to-date top-k results for a large number of TaSK queries over a stream of spatio-temporal documents. To address the second limitation, we develop the first diversity-aware publish/subscribe system over a text stream. Specifically, we propose the Diversity-Aware Top-k Subscription (DAS) query, which takes into account text relevance, document recency, and result diversity in matching a new document. We propose an efficient mechanism to continuously maintain an up-to-date result set that contains k most recently returned documents over a text stream for each DAS query.
DRNTU::Engineering::Computer science and engineering::Information systems::Database management