Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Batch Reinforcement Learning Through Continuation Method
Authors: Yijie Guo, Shengyu Feng, Nicolas Le Roux, Ed Chi, Honglak Lee, Minmin Chen
ICLR 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present results on a variety of control tasks, game environments and a recommendation task to empirically demonstrate the efficacy of our proposed method. ... 4 EXPERIMENTS We evaluate our method with several baselines on continuous control tasks. |
| Researcher Affiliation | Collaboration | 1University of Michigan 2Google AI |
| Pseudocode | Yes | Algorithm 1 Soft Policy Iteration through Continuation Method |
| Open Source Code | No | The paper does not provide an explicit statement about releasing source code or a link to a code repository for the described methodology. |
| Open Datasets | Yes | We use a publicly available dataset Movie Lens-1M, a popular benchmark for recommender system. ... We focus on eight games and generate the datasets as discussed in Fujimoto et al. [13]. |
| Dataset Splits | No | The paper mentions using a 'training dataset D' and a 'held-out test set' for evaluation but does not specify explicit dataset splits (e.g., percentages or counts) for train, validation, or test sets. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used for running its experiments. |
| Software Dependencies | No | The paper does not specify software dependencies with version numbers (e.g., programming languages, libraries, or frameworks). |
| Experiment Setup | Yes | We set to large value initially and let the KL divergence term dominate the objective, thus performing behavior cloning. We record a moving average of the Q value estimation variance var(Q , 0) over 1000 updates at the end of the phase. After that, we decay the temperature gradually with λ = 0.9 every I steps. |