Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Multi-Mention Learning for Reading Comprehension with Neural Cascades
Authors: Swabha Swayamdipta, Ankur P. Parikh, Tom Kwiatkowski
ICLR 2018 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, our approach achieves state-of-the-art performance on both the Wikipedia and web domains of the Trivia QA dataset, outperforming more complex, recurrent architectures. Our experimental results ( 4) show that all the above are essential in helping our model achieve state-of-the-art performance. |
| Researcher Affiliation | Collaboration | Swabha Swayamdipta Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213, USA EMAIL Ankur P. Parikh & Tom Kwiatkowski Google Research New York, NY 10011, USA EMAIL |
| Pseudocode | No | The paper describes the model architecture and components using equations and textual descriptions, but it does not include a clearly labeled pseudocode block or algorithm. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing its source code or a direct link to a code repository for the methodology described. It mentions using open-source tools like TensorFlow and NLTK, but not its own implementation. |
| Open Datasets | Yes | The Trivia QA dataset (Joshi et al., 2017) contains a collection of 95k trivia question-answer pairs from several online trivia sources. We use GloVe embeddings (Pennington et al., 2014) of dimension 300 (trained on a corpus of 840 billion words) that are not updated during training. |
| Dataset Splits | Yes | Table 2 shows some ablations that give more insight into the different contributions of our model components. Our final approach (3-Level Cascade, Multiloss) achieves the best performance... on the full Wikipedia development set. Figure 3 (left) shows the behavior of the k-best predictions of different models on the human-validated Wikipedia development set. |
| Hardware Specification | Yes | Each hyperparameter setting took 2-3 days to train on a single NVIDIA P100 GPU. ... (both approaches use a P100 GPU). |
| Software Dependencies | No | The paper mentions software like NLTK ('All documents are tokenized using the NLTK4 tokenizer.4http://www.nltk.org') and TensorFlow ('The model was implemented in TensorFlow (Abadi et al., 2016)'). However, it does not provide specific version numbers for these software components, which is required for a 'Yes' classification. |
| Experiment Setup | Yes | We additionally tuned the following hyperparameters using grid search and indicate the optimal values in parantheses: network size (2-layers, each with 300 neurons), dropout ratio (0.1), learning rate (0.05), context size (1), and loss weights (λ1 = λ2 = 0.35, λ3 = 0.2, λ4 = 0.1). We use Adagrad (Duchi et al., 2011) for optimization (default initial accumulator value set to 0.1, batch size set to 1). |