Core-sets for Fair and Diverse Data Summarization
Authors: Sepideh Mahabadi, Stojan Trajanovski
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we run several experiments showing the effectiveness of our core-set approach. In particular, we apply constrained diversity maximization to summarize a set of timed messages that takes into account the messages recency. Specifically, the summary should include more recent messages compared to older ones. This is a real task in one of the largest communication platforms, affecting the experience of hundreds of millions daily active users. By utilizing our core-set method for this task, we achieve a 100x speed-up while losing the diversity by only a few percent. Moreover, our approach allows us to improve the space usage of the algorithm in the streaming setting. |
| Researcher Affiliation | Industry | Sepideh Mahabadi Microsoft Research Redmond, WA, USA smahabadi@microsoft.com Stojan Trajanovski Microsoft London, United Kingdom sttrajan@microsoft.com |
| Pseudocode | Yes | Algorithm 1 The GMM Algorithm Input a point set P, and k Output subset S = {p1, . . . , pk} P, and radii r1, . . . , rk. 1: S an arbitrary point from P, and r1 2: for i = 2 to k do 3: pi the farthest point in P from S, i.e., argmaxp P dist(p, S) 4: S S pi 5: ri minimum pairwise distance in S, i.e., minp1,p2 S dist(p1, p2) 6: end for 7: return S |
| Open Source Code | Yes | We have also made the code of the algorithms publicly available2. 2https://github.com/microsoft/coresets-fair-diverse |
| Open Datasets | Yes | We use a Reddit dataset [42] of text messages that are semantically embedded using BERT [12] into a metric space. We also utilize message creation time stamps as we aim to show diverse, yet timely and relevant messages to a user created across different time-window intervals within a certain period and thus assign a color to each message based on to which time-window interval (e.g., a week, thus having 4 colors in total) the message creation time belongs to. [...] We also utilize the Movie Lens dataset [18] where the movie titles are semantically embedded into a metric space. |
| Dataset Splits | No | The paper does not explicitly provide details about train/validation/test dataset splits or cross-validation. It describes partitioning data into 'groups/colors' for the fair diversity maximization task, not typical ML training splits. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used to run the experiments, such as GPU or CPU models, or memory specifications. |
| Software Dependencies | No | The paper mentions BERT [12], MPNETv2 [35], T5 [29], RoBERTa [23], Mini LM [38], and GloVe [32] as semantic embeddings and models, and Reddit API for data extraction. However, it does not specify version numbers for any of these software components or libraries. |
| Experiment Setup | Yes | We use a Reddit dataset [42] of text messages that are semantically embedded using BERT [12] into a metric space. [...] We model this task by partitioning the messages into groups / colors based on their creation times and assign a desired budget to each group. [...] color_i = m (t_i - t_min) / (t_max - t_min) where t_min and t_max are the minimum and maximum creation timestamp within the considered full input data and m is the desired total number of colors. In this way, we have m equidistant time intervals such that messages (relatively) close to one another by creation time belong to the same color and the most recent messages are in the m-th color. In this way, we can also run Fair Diversity Maximization (FDM) for a desired distribution of number of messages within a color, i.e., k_i, for i = 1, 2, ..., m, either by fairness (all k_i s equal) or by recency (k_i increases with i). |