Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
MeanSum: A Neural Model for Unsupervised Multi-Document Abstractive Summarization
Authors: Eric Chu, Peter Liu
ICML 2019 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show through automated metrics and human evaluation that the generated summaries are highly abstractive, fluent, relevant, and representative of the average sentiment of the input reviews. Finally, we collect a reference evaluation dataset and show that our model outperforms a strong extractive baseline. |
| Researcher Affiliation | Collaboration | 1MIT Media Lab 2Google Brain. |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code is available online3. https://github.com/sosuperic/Mean Sum |
| Open Datasets | Yes | We tuned our models primarily on a dataset of customer reviews provided in the Yelp Dataset Challenge, where each review is accompanied by a 5-star rating. https://www.yelp.com/dataset/challenge |
| Dataset Splits | Yes | The final training, validation, and test splits consist of 10695, 1337, and 1337 businesses, and 1038184, 129856, and 129840 reviews, respectively. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. It only describes software architecture and training parameters. |
| Software Dependencies | No | The paper mentions general algorithms and models like "multiplicative LSTM" and "Adam" but does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | The language model, encoders, and decoders were multiplicative LSTM s (Krause et al., 2016) with 512 hidden units, a 0.1 dropout rate, a word embedding size of 256, and layer normalization (Ba et al., 2016). We used Adam (Kingma & Ba, 2014) to train, a learning rate of 0.001 for the language model, a learning rate of 0.0001 for the classifier, and a learning rate of 0.0005 for the summarization model, with β1 = 0.9 and β2 = 0.999. The initial temperature for the Gumbel-softmax was set to 2.0. One input item to the language model was k = 8 reviews from the same business or product concatenated together with end-of-review delimiters, with each update step operating on a subsequence of 256 subtokens. The review-rating classifier was a multi-channel text convolutional neural network similar to Kim (2014) with 3,4,5 width filters, 128 feature maps per filter, and a 0.5 dropout rate. |