Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

An Investigation of Offline Reinforcement Learning in Factorisable Action Spaces

Authors: Alex Beeson, David Ireland, Giovanni Montana

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we undertake a formative investigation into offline reinforcement learning in factorisable action spaces. Using value-decomposition as formulated in Dec QN as a foundation, we present the case for a factorised approach and conduct an extensive empirical evaluation of several offline techniques adapted to the factorised setting. In the absence of established benchmarks, we introduce a suite of our own comprising datasets of varying quality and task complexity.
Researcher Affiliation Academia 1Warwick Manufacturing Group, University of Warwick, Coventry, UK 2Warwick Medical School, University of Warwick, Coventry, UK 3Department of Statistics, University of Warwick, Coventry, UK 4Alan Turing Institute, London, UK
Pseudocode Yes Algorithm 1 Dec QN-BCQ Require: Threshold τ, discounter factor γ, target network update rate µ, number sub-action spaces N and dataset B. Initialise utility function parameters θ = {θi}N i=1, corresponding target parameters ˆθ = θ and policy parameters ϕ = {ϕi}N i=1 for t = 0 to T do Sample minibatch of transitions (s, a, r, s ) from B ϕ arg minϕ 1 N PN i=1 P s,ai log πi ϕi(ai | s) θ arg minθ P s,a,r,s (Qθ(s, a) y)2 Qθ(s, a) = 1/N PN i=1 U i θi(s, ai), y = r + γ/N PN i=1 maxa i : ρi(a i) τ U i ˆθi(s , a i), ρi(a i) = πi ϕi(a i | s )/ maxˆa i πi ϕi(ˆa i | s ) ˆθ µθ + (1 µ)ˆθ end for
Open Source Code Yes In the spirit of advancing research in this area, we provide open access to these datasets as well as our full code base: https://github.com/Alex Beeson Warwick/ Offline RLFactorisable Action Spaces.
Open Datasets Yes In the spirit of advancing research in this area, we provide open access to these datasets as well as our full code base: https://github.com/Alex Beeson Warwick/ Offline RLFactorisable Action Spaces.
Dataset Splits No The paper describes how datasets are *composed* from different policies (expert, medium, random) and combined (e.g., '45% random and medium transitions and 10% expert'). However, it does not specify how these created datasets are then *split* into training, validation, and test sets for the models being evaluated, or reference any standard dataset splits for model training and evaluation in the context of reproducibility.
Hardware Specification No The paper mentions 'GPU usage' in tables (e.g., Table 5 and 6) as a metric for computation, but it does not specify any particular GPU models, CPU models, memory sizes, or other specific hardware components used for running the experiments.
Software Dependencies No The paper mentions using the 'Adam optimiser (Kingma & Ba, 2014)' but does not provide a specific version number for it or for any other key software libraries, frameworks (like Python, PyTorch), or environments used in the implementation.
Experiment Setup Yes Parameters Value Optimizer Adam Learning rate 1 10 4 Replay size 5 105 n-step returns 3 Discount, γ 0.99 Batch size 256 Hidden size 512 Gradient clipping 40 Target network update parameter, c 0.005 Imp. sampling exponent 0.2 Priority exponent 0.6 Minimum exploration, ϵ 0.05 ϵ decay rate 0.99995 Regularisation loss coefficient β 0.5 Ensemble size K 10