Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Multi-Mention Learning for Reading Comprehension with Neural Cascades

Authors: Swabha Swayamdipta, Ankur P. Parikh, Tom Kwiatkowski

ICLR 2018 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, our approach achieves state-of-the-art performance on both the Wikipedia and web domains of the Trivia QA dataset, outperforming more complex, recurrent architectures. Our experimental results ( 4) show that all the above are essential in helping our model achieve state-of-the-art performance.
Researcher Affiliation Collaboration Swabha Swayamdipta Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213, USA EMAIL Ankur P. Parikh & Tom Kwiatkowski Google Research New York, NY 10011, USA EMAIL
Pseudocode No The paper describes the model architecture and components using equations and textual descriptions, but it does not include a clearly labeled pseudocode block or algorithm.
Open Source Code No The paper does not provide an explicit statement about releasing its source code or a direct link to a code repository for the methodology described. It mentions using open-source tools like TensorFlow and NLTK, but not its own implementation.
Open Datasets Yes The Trivia QA dataset (Joshi et al., 2017) contains a collection of 95k trivia question-answer pairs from several online trivia sources. We use GloVe embeddings (Pennington et al., 2014) of dimension 300 (trained on a corpus of 840 billion words) that are not updated during training.
Dataset Splits Yes Table 2 shows some ablations that give more insight into the different contributions of our model components. Our final approach (3-Level Cascade, Multiloss) achieves the best performance... on the full Wikipedia development set. Figure 3 (left) shows the behavior of the k-best predictions of different models on the human-validated Wikipedia development set.
Hardware Specification Yes Each hyperparameter setting took 2-3 days to train on a single NVIDIA P100 GPU. ... (both approaches use a P100 GPU).
Software Dependencies No The paper mentions software like NLTK ('All documents are tokenized using the NLTK4 tokenizer.4http://www.nltk.org') and TensorFlow ('The model was implemented in TensorFlow (Abadi et al., 2016)'). However, it does not provide specific version numbers for these software components, which is required for a 'Yes' classification.
Experiment Setup Yes We additionally tuned the following hyperparameters using grid search and indicate the optimal values in parantheses: network size (2-layers, each with 300 neurons), dropout ratio (0.1), learning rate (0.05), context size (1), and loss weights (λ1 = λ2 = 0.35, λ3 = 0.2, λ4 = 0.1). We use Adagrad (Duchi et al., 2011) for optimization (default initial accumulator value set to 0.1, batch size set to 1).