Description

Skyline queries have been a topic of intense study in the database area for over a decade now. Similarly, the top-k retrieval query has been heavily investigated by both the database as well as the web research communities. This tutorial will delve into the background of these two areas, and specifically focus on the recent challenges with respect to returning a small set of results to users, as well as requiring minimal intervention or input from them. These are two main concerns with skylines and top-k respectively, and therefore have drawn a great deal of attention in the recent years with several interesting ideas being proposed in the research community.

This tutorial will cover the current approaches to representative database selection. We will focus on both the theoretical models as well as the practical aspects from an industry standpoint. The topics of covered in this tutorial will include identifying representative subsets of the skyline set, interaction based approaches, e-commerce product search, and leveraging aggregate user preference statistics.

For more information about the tutorial, please contact Atish Das Sarma at adassarma (at) ebay.com.

Presenters

Atish Das Sarma is currently a Staff Research Scientist at eBay Research Labs. Prior to his current position, Atish was a Research Scientist at Google Research. Atish received his Ph.D. degree from Georgia Tech. and undergraduate from IIT-Bombay, all in computer science. Atish has published over 40 research papers and led around 10 patents. Atish received the Best Paper Award at PODS-2008, a Facebook award in 2009 for work done that was named a top-50 fbFund Finalist for most promising upcoming start-up ideas, and a Google award for excellent paper in structured data". His works have been featured in New Scientist, and MIT Technology Review, and invited to interview at NPR, and several of his papers have been invited to top journals. Atish serves on numerous international program committees (such as KDD, WWW, WSDM, VLDB, SIGMOD, CIKM, ICWSM, and others) and has given several invited talks. His research publications include those in algorithms (STOC, SICOMP, JACM, TCS), Databases (SIGMOD, VLDB, PODS, ICDE), Data Mining (WSDM, WWW, KDD), and Distributed Computing (PODC, SPAA, DISC, INFOCOM).

Ashwin Lall has been at the Department of Mathematics and Computer Science at Denison University since 2010. He was recently named a Bayley-Bowen Fellow (2013-16). He received his Ph.D. from the University of Rochester, where he was a Sproull Fellow, and was a postdoc at Georgia Tech. His research interests span different areas, including databases (with papers at VLDB, ICDE, SIGMOD), measurement (SIGMETRICS, INFOCOM, IMC), and machine learning (NIPS, ACL). More specically, he works in the areas of representative databases, streaming algorithms, networking, natural language processing, and distributed computing. He co-chaired the Big Data Analytics workshop at SIGMETRICS 2013.

Nish Parikh joined eBay Research Labs in February 2008 and is a Senior Sta Research Scientist. At eBay Research Labs, he works on query analysis, recommender systems and large-scale data processing. Prior to joining eBay Research Labs he was part of the team that launched eBay's Next Generation Search Engine Voyager which supports near realtime indexing of products and serves more than 250M queries a day. Prior to joining eBay, Nish received an M.S. in Computer Science from University of Southern California and a B.S. in Electrical Engineering from Gujarat University where he was awarded a gold medal for academic excellence. Nish has published in premier conferences such as SIGIR, KDD, WWW, CIKM, WSDM. In addition to the research community engagement, Nish is a frequent speaker in industry and big data forums such as the Hadoop Summit.

Neel Sundaresan has been with eBay since 2005 and has been managed the research team for the past 7 years during which tenure he created research departments in big data science, economics, computer vision, human language technologies, human computer interaction, among others. He drives a platform based approach to research and has led research in diverse areas from search and recommender systems, trust and reputation systems, social networks, and large scale data analysis systems. He has over 60 research publications, over 80 issued patents and several patents pending. He is a frequent invited speaker at several national and international technical conferences.

Topics

Identifying representative subset of the skyline set: There are several interesting papers within this approach. We will delve into the recent works of [4-6], and mention their approaches. Lin et al. [4] propose finding a set of k skyline tuples that dominate the most number of tuples. Tao et al. [5] illustrates that the approach of [4] could lead to an undesirable solution. This is essentially because of the fact that it is unstable. They propose an alternative distance-based solution which is to solve the k-center problem on the skyline tuples. However, our primary focus will be our work [6] on regret-minimizing representative databases which followed the work of [4, 5]. The tutorial will discuss the relative merits of each of these techniques and show comparative results.

In our paper [6], we consider the setting where users have unknown utility functions. Given a list of k tuples, we say that a user is x% happy with the list if the utility she obtains from the best tuple in this list is at least x% of the utility she obtains from the best tuple in the whole database. We attempt for an operator that outputs a small set of k tuples without asking for users' utility functions and yet makes every user at least x% happy. We show that such an algorithm indeed exists and even more surprisingly, the happiness guarantee is independent of the size of the database.

Interaction based approaches: Our work in this technique builds on [6] and presents an interactive regret minimization framework [7]. In this paper, we adopt the notion of maximum regret ratio when users have linear utility functions proposed in [6] since it makes no additional assumptions about the users. Here we study how interactions help further improve the guarantees on user happiness. The goal is to rene the returned result set once the user responds with the best tuple from the results presented in the previous iteration.

There are a couple of other approaches also that offer different modes of interaction to improve the quality of the returned set of results, and we'll mention some of them in the tutorial. For example, Mindolin et al. [8] propose the p-skyline query which is based on the premise that different attributes have varying levels of importance for users (for e.g. a buyer may care a lot about the reputation/trust of the seller while another may associate maximal importance with the price of the item for sale). They suggest a feedback mechanism where users can partition the set of returned results to identify desired tuples, and this is then fed back into the algorithm for improvement.

e-commerce product search: One of the primary works we will focus on here from the industrial application standpoint is our paper that goes beyond relevance in marketplace search [9]. In this paper we study diversity and its relations to search relevance in the context of an online marketplace. We conduct a large-scale log-based study using click-stream data from a leading eCommerce site. We introduce 3 main metrics: selection (diversity), trust, and value. In our analysis we also show how these interact with relevance in different ways. We study the benets of diversity and also show why guaranteeing diversity is important. On this topic of eBay or e-commerce specic product search, we also performed work on leveraging user interaction with a quick learning approach that was demonstrated in [10]. We will present this user-tunable approach to marketplace search as well. Learning from user behavior during interactions, we show how performance in product search can be enhanced quickly at run time.

In [11], the authors describe a framework that systematically rewards novelty and diversity. The framework makes a precise distinction between novelty (the need to avoid redundancy), and diversity (the need to resolve ambiguity).

In [12], the problem of efficiently computing diverse query results in online shopping applications is studied. It shows efficient query processing techniques that guarantee diversity. Although some attributes of a narrow section of inventory (automobiles) that impact diversity are discussed in [12], we will go into greater insights combining our work to delve into what dimensions impact user satisfaction and diversity in ecommerce search. Specically, we will elaborate on how relevance and diversity interplay in e-commerce search. As monetary transactions are involved in online marketplaces, there are various factors like reputation of sellers and value proposition of products that play a big role. Our work uses a large-scale retrospective dataset from eBay which has, at any point in time, millions of items for sale in different categories by different sellers.

Key References

[1] I. F. Ilyas, G. Beskales, and M. A. Soliman, "A survey of top- query processing techniques in relational database systems," ACM Comput. Surv., vol. 40, no. 4, 2008.

[2] P. Godfrey, R. Shipley, and J. Gryz, "Algorithms and analyses for maximal vector computation," VLDB J., vol. 16, no. 1, pp. 5-28, 2007, also appeared in VLDB'05.

[3] S. Borzonyi, D. Kossmann, and K. Stocker, "The skyline operator," in ICDE, 2001, pp. 421-430.

[4] X. Lin, Y. Yuan, Q. Zhang, and Y. Zhang, "Selecting stars: The k most representative skyline operator," in ICDE, 2007, pp. 86-95.

[5] Y. Tao, L. Ding, X. Lin, and J. Pei, "Distance-based representative skyline," in ICDE, 2009.

[6] D. Nanongkai, A. Das Sarma, A. Lall, R. J. Lipton, and J. J. Xu, "Regret-minimizing representative databases," PVLDB, vol. 3, no. 1, pp. 1114-1124, 2010.

[7] D. Nanongkai, A. Lall, A. Das Sarma, and K. Makino, "Interactive regret minimization," in SIGMOD Conference, 2012, pp. 109-120.

[8] D. Mindolin and J. Chomicki, "Discovering relative importance of skyline attributes," PVLDB, vol. 2, no. 1, pp. 610-621, 2009.

[9] N. Parikh and N. Sundaresan, "Beyond relevance in marketplace search," in CIKM, 2011, pp. 2109-2112.

[10] N. Parikh, N. Sundaresan, "A user-tunable approach to marketplace search," in WWW (Companion Volume), 2011, pp. 245-248.

[11] C. L. A. Clarke, M. Kolla, G. V. Cormack, O. Vechtomova, A. Ashkan, S. Buttcher, and I. MacKinnon, "Novelty and diversity in information retrieval evaluation," in SIGIR, 2008, pp. 659-666.

[12] E. Vee, U. Srivastava, J. Shanmugasundaram, P. Bhat, and S. Amer-Yahia, "Efficient computation of diverse query results," in ICDE, 2008, pp. 228-236.

[13] A. Das Sarma, A. Lall, D. Nanongkai, R. J. Lipton, and J. J. Xu, "Representative skylines using threshold-based preference distributions," in ICDE, 2011, pp. 387-398.

[14] A. Das Sarma, A. Lall, D. Nanongkai, and J. J. Xu, "Randomized multi-pass streaming skyline algorithms," PVLDB, vol. 2, no. 1, pp. 85-96, 2009.

[15] A. N. Papadopoulos, A. Lyritsis, A. Nanopoulos, and Y. Manolopoulos, "Domination mining and querying," in DaWaK, 2007, pp. 145-156.

[16] M. L. Yiu and N. Mamoulis, "Multi-dimensional top dominating queries," VLDB J., vol. 18, no. 3, pp.695-718, 2009, also VLDB'07.

[17] W. Zhang, X. Lin, Y. Zhang, J. Pei, and W. Wang, "Threshold-based Probabilistic Top-k Dominating Queries."

[18] X. Lian and L. C. 0002, "Top-k dominating queries in uncertain databases," in EDBT, 2009, pp. 660-671.

[19] M. Kontaki, A. N. Papadopoulos, and Y. Manolopoulos, "Continuous top-k dominating queries in subspaces," in Panhellenic Conference on Informatics, 2008, pp. 31-35.

[20] M. Goncalves and M.-E. Vidal, "Top-k skyline: A unied approach," in OTM Workshops, 2005.

[21] T. Xia, D. Zhang, and Y. Tao, "On skylining with flexible dominance relation," in ICDE, 2008, pp. 1397-1399.

[22] J. Lee, G. won You, and S. won Hwang, "Personalized top-k skyline queries in high-dimensional space," Inf. Syst., vol. 34, no. 1, pp. 45-61, 2009.

[23] C. Y. Chan, H. V. Jagadish, K.-L. Tan, A. K. H. Tung, and Z. Zhang, "On high dimensional skylines," in EDBT, 2006, pp. 478-495.

[24] C.-Y. Chan, H.V. Jagadish, K.-L. Tan, A.K.H. Tung, and Z. Zhang, "Finding k-dominant skylines in high dimensional space," in SIGMOD, 2006.

[25] B. Jiang, J. Pei, X. Lin, D. W. Cheung, and J. Han, "Mining preferences from superior and inferior examples," in KDD, 2008, pp. 390-398.

[26] Z. Xu, R. Akella, and Y. Zhang, "Incorporating diversity and density in active learning for relevance feedback," in ECIR, 2007, pp. 246-257.

[27] R. Agrawal, S. Gollapudi, A. Halverson, and S. Ieong, "Diversifying search results," in WSDM, 2009, pp. 5-14.

[28] S. Gollapudi and A. Sharma, "An axiomatic approach for result diversication," in WWW, 2009, pp. 381-390.

[29] F. Radlinski, P. N. Bennett, B. Carterette, and T. Joachims, "Redundancy, diversity and interdependent document relevance," SIGIR Forum, vol. 43, no. 2, pp. 46-52, 2009.

[30] L. McGinty and B. Smyth, "On the role of diversity in conversational recommender systems," in ICCBR, 2003, pp. 276-290.

[31] C.-N. Ziegler, S. M. McNee, J. A. Konstan, and G. Lausen, "Improving recommendation lists through topic diversication," in WWW, 2005, pp. 22-32.

[32] R. Price and P. R. Messinger, "Optimal recommendation sets: Covering uncertainty over user preferences," in AAAI, 2005, pp. 541-548.