Core-sets for Fair and Diverse Data Summarization

Authors: Sepideh Mahabadi, Stojan Trajanovski

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we run several experiments showing the effectiveness of our core-set approach. In particular, we apply constrained diversity maximization to summarize a set of timed messages that takes into account the messages recency. Specifically, the summary should include more recent messages compared to older ones. This is a real task in one of the largest communication platforms, affecting the experience of hundreds of millions daily active users. By utilizing our core-set method for this task, we achieve a 100x speed-up while losing the diversity by only a few percent. Moreover, our approach allows us to improve the space usage of the algorithm in the streaming setting.
Researcher Affiliation Industry Sepideh Mahabadi Microsoft Research Redmond, WA, USA smahabadi@microsoft.com Stojan Trajanovski Microsoft London, United Kingdom sttrajan@microsoft.com
Pseudocode Yes Algorithm 1 The GMM Algorithm Input a point set P, and k Output subset S = {p1, . . . , pk} P, and radii r1, . . . , rk. 1: S an arbitrary point from P, and r1 2: for i = 2 to k do 3: pi the farthest point in P from S, i.e., argmaxp P dist(p, S) 4: S S pi 5: ri minimum pairwise distance in S, i.e., minp1,p2 S dist(p1, p2) 6: end for 7: return S
Open Source Code Yes We have also made the code of the algorithms publicly available2. 2https://github.com/microsoft/coresets-fair-diverse
Open Datasets Yes We use a Reddit dataset [42] of text messages that are semantically embedded using BERT [12] into a metric space. We also utilize message creation time stamps as we aim to show diverse, yet timely and relevant messages to a user created across different time-window intervals within a certain period and thus assign a color to each message based on to which time-window interval (e.g., a week, thus having 4 colors in total) the message creation time belongs to. [...] We also utilize the Movie Lens dataset [18] where the movie titles are semantically embedded into a metric space.
Dataset Splits No The paper does not explicitly provide details about train/validation/test dataset splits or cross-validation. It describes partitioning data into 'groups/colors' for the fair diversity maximization task, not typical ML training splits.
Hardware Specification No The paper does not provide specific details about the hardware used to run the experiments, such as GPU or CPU models, or memory specifications.
Software Dependencies No The paper mentions BERT [12], MPNETv2 [35], T5 [29], RoBERTa [23], Mini LM [38], and GloVe [32] as semantic embeddings and models, and Reddit API for data extraction. However, it does not specify version numbers for any of these software components or libraries.
Experiment Setup Yes We use a Reddit dataset [42] of text messages that are semantically embedded using BERT [12] into a metric space. [...] We model this task by partitioning the messages into groups / colors based on their creation times and assign a desired budget to each group. [...] color_i = m (t_i - t_min) / (t_max - t_min) where t_min and t_max are the minimum and maximum creation timestamp within the considered full input data and m is the desired total number of colors. In this way, we have m equidistant time intervals such that messages (relatively) close to one another by creation time belong to the same color and the most recent messages are in the m-th color. In this way, we can also run Fair Diversity Maximization (FDM) for a desired distribution of number of messages within a color, i.e., k_i, for i = 1, 2, ..., m, either by fairness (all k_i s equal) or by recency (k_i increases with i).