Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
An Investigation of Offline Reinforcement Learning in Factorisable Action Spaces
Authors: Alex Beeson, David Ireland, Giovanni Montana
TMLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we undertake a formative investigation into offline reinforcement learning in factorisable action spaces. Using value-decomposition as formulated in Dec QN as a foundation, we present the case for a factorised approach and conduct an extensive empirical evaluation of several offline techniques adapted to the factorised setting. In the absence of established benchmarks, we introduce a suite of our own comprising datasets of varying quality and task complexity. |
| Researcher Affiliation | Academia | 1Warwick Manufacturing Group, University of Warwick, Coventry, UK 2Warwick Medical School, University of Warwick, Coventry, UK 3Department of Statistics, University of Warwick, Coventry, UK 4Alan Turing Institute, London, UK |
| Pseudocode | Yes | Algorithm 1 Dec QN-BCQ Require: Threshold τ, discounter factor γ, target network update rate µ, number sub-action spaces N and dataset B. Initialise utility function parameters θ = {θi}N i=1, corresponding target parameters ˆθ = θ and policy parameters ϕ = {ϕi}N i=1 for t = 0 to T do Sample minibatch of transitions (s, a, r, s ) from B ϕ arg minϕ 1 N PN i=1 P s,ai log πi ϕi(ai | s) θ arg minθ P s,a,r,s (Qθ(s, a) y)2 Qθ(s, a) = 1/N PN i=1 U i θi(s, ai), y = r + γ/N PN i=1 maxa i : ρi(a i) τ U i ˆθi(s , a i), ρi(a i) = πi ϕi(a i | s )/ maxˆa i πi ϕi(ˆa i | s ) ˆθ µθ + (1 µ)ˆθ end for |
| Open Source Code | Yes | In the spirit of advancing research in this area, we provide open access to these datasets as well as our full code base: https://github.com/Alex Beeson Warwick/ Offline RLFactorisable Action Spaces. |
| Open Datasets | Yes | In the spirit of advancing research in this area, we provide open access to these datasets as well as our full code base: https://github.com/Alex Beeson Warwick/ Offline RLFactorisable Action Spaces. |
| Dataset Splits | No | The paper describes how datasets are *composed* from different policies (expert, medium, random) and combined (e.g., '45% random and medium transitions and 10% expert'). However, it does not specify how these created datasets are then *split* into training, validation, and test sets for the models being evaluated, or reference any standard dataset splits for model training and evaluation in the context of reproducibility. |
| Hardware Specification | No | The paper mentions 'GPU usage' in tables (e.g., Table 5 and 6) as a metric for computation, but it does not specify any particular GPU models, CPU models, memory sizes, or other specific hardware components used for running the experiments. |
| Software Dependencies | No | The paper mentions using the 'Adam optimiser (Kingma & Ba, 2014)' but does not provide a specific version number for it or for any other key software libraries, frameworks (like Python, PyTorch), or environments used in the implementation. |
| Experiment Setup | Yes | Parameters Value Optimizer Adam Learning rate 1 10 4 Replay size 5 105 n-step returns 3 Discount, γ 0.99 Batch size 256 Hidden size 512 Gradient clipping 40 Target network update parameter, c 0.005 Imp. sampling exponent 0.2 Priority exponent 0.6 Minimum exploration, ϵ 0.05 ϵ decay rate 0.99995 Regularisation loss coefficient β 0.5 Ensemble size K 10 |