Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

An Investigation of Offline Reinforcement Learning in Factorisable Action Spaces

Authors: Alex Beeson, David Ireland, Giovanni Montana

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we undertake a formative investigation into offline reinforcement learning in factorisable action spaces. Using value-decomposition as formulated in Dec QN as a foundation, we present the case for a factorised approach and conduct an extensive empirical evaluation of several offline techniques adapted to the factorised setting. In the absence of established benchmarks, we introduce a suite of our own comprising datasets of varying quality and task complexity.
Researcher Affiliation	Academia	1Warwick Manufacturing Group, University of Warwick, Coventry, UK 2Warwick Medical School, University of Warwick, Coventry, UK 3Department of Statistics, University of Warwick, Coventry, UK 4Alan Turing Institute, London, UK
Pseudocode	Yes	Algorithm 1 Dec QN-BCQ Require: Threshold τ, discounter factor γ, target network update rate µ, number sub-action spaces N and dataset B. Initialise utility function parameters θ = {θi}N i=1, corresponding target parameters ˆθ = θ and policy parameters ϕ = {ϕi}N i=1 for t = 0 to T do Sample minibatch of transitions (s, a, r, s ) from B ϕ arg minϕ 1 N PN i=1 P s,ai log πi ϕi(ai \| s) θ arg minθ P s,a,r,s (Qθ(s, a) y)2 Qθ(s, a) = 1/N PN i=1 U i θi(s, ai), y = r + γ/N PN i=1 maxa i : ρi(a i) τ U i ˆθi(s , a i), ρi(a i) = πi ϕi(a i \| s )/ maxˆa i πi ϕi(ˆa i \| s ) ˆθ µθ + (1 µ)ˆθ end for
Open Source Code	Yes	In the spirit of advancing research in this area, we provide open access to these datasets as well as our full code base: https://github.com/Alex Beeson Warwick/ Offline RLFactorisable Action Spaces.
Open Datasets	Yes	In the spirit of advancing research in this area, we provide open access to these datasets as well as our full code base: https://github.com/Alex Beeson Warwick/ Offline RLFactorisable Action Spaces.
Dataset Splits	No	The paper describes how datasets are composed from different policies (expert, medium, random) and combined (e.g., '45% random and medium transitions and 10% expert'). However, it does not specify how these created datasets are then split into training, validation, and test sets for the models being evaluated, or reference any standard dataset splits for model training and evaluation in the context of reproducibility.
Hardware Specification	No	The paper mentions 'GPU usage' in tables (e.g., Table 5 and 6) as a metric for computation, but it does not specify any particular GPU models, CPU models, memory sizes, or other specific hardware components used for running the experiments.
Software Dependencies	No	The paper mentions using the 'Adam optimiser (Kingma & Ba, 2014)' but does not provide a specific version number for it or for any other key software libraries, frameworks (like Python, PyTorch), or environments used in the implementation.
Experiment Setup	Yes	Parameters Value Optimizer Adam Learning rate 1 10 4 Replay size 5 105 n-step returns 3 Discount, γ 0.99 Batch size 256 Hidden size 512 Gradient clipping 40 Target network update parameter, c 0.005 Imp. sampling exponent 0.2 Priority exponent 0.6 Minimum exploration, ϵ 0.05 ϵ decay rate 0.99995 Regularisation loss coefficient β 0.5 Ensemble size K 10