Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Learning to Reuse Policies in State Evolvable Environments
Authors: Ziqian Zhang, Bohan Yang, Lihe Li, Yuqi Bian, Ruiqi Xue, Feng Chen, Yi-Chen Li, Lei Yuan, Yang Yu
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we present our experimental analysis conducted across eight diverse tasks, including continuous control tasks from Mujoco, featuring vectorial state features, and Atari games with pixel-based image inputs and discrete action spaces. Our experiments are designed to address three critical questions: (1) Whether Lapse achieves superior adaptation capabilities in state evolvable environments compared to existing methods (Section 4.2)? (2) How does the learning process of Lapse proceed in evolving stages, assessing its adaptability (Section 4.3)? (3) What contributions do the different components and hyper-parameters of Lapse make to its overall performance (Section 4.4)? For a comprehensive evaluation, Lapse is compared against multiple baselines. All results are averaged over five random seeds and are presented with their corresponding standard deviations. |
| Researcher Affiliation | Collaboration | 1National Key Laboratory for Novel Software Technology, Nanjing University 2School of Artificial Intelligence, Nanjing University 3Polixir Technologies. Correspondence to: Lei Yuan <EMAIL>, Yang Yu <EMAIL>. |
| Pseudocode | Yes | Detailed pseudocode is provided in Appendix C. [...] Algorithm 1 recon train(n, Dn, πn) [...] Algorithm 2 off train(n, Dn, πn, βn+1) [...] Algorithm 3 Lapse |
| Open Source Code | Yes | Our code is available at https://github.com/zzq-bot/Lapse |
| Open Datasets | Yes | Extensive experiments are performed both in Mu Jo Co control tasks (Todorov et al., 2012) with vectorial state features and Atari games (Bellemare et al., 2013) with pixel-based images. [...] For evaluation, we select four tasks from the Gym Mu Jo Co suite (Todorov et al., 2012): Ant, Half Cheetah, Hopper, and Walker. [...] In the context of Atari games (Bellemare et al., 2013), which involve pixel-based image inputs and discrete action spaces, we employ DQN (Mnih et al., 2013) as the backbone to train the robust RADIAL agent (Oikarinen et al., 2021) π0 for games including Bank Heist, Freeway, Pong, and Road Runner, following the setup outlined in Oikarinen et al. (2021). |
| Dataset Splits | No | The paper mentions: 'We restrict the number of trajectories in the dataset Dn to 10 and 15 in Mujoco and Atari, respectively.' This refers to the amount of experience collected for offline learning during adaptation, not a traditional train/test/validation split of a static dataset. No specific percentages, sample counts, or predefined splits for the main environments are provided. |
| Hardware Specification | No | No specific hardware details (such as GPU models, CPU models, or memory specifications) used for running the experiments are explicitly mentioned in the paper. The paper discusses environments and software, but not the computational infrastructure. |
| Software Dependencies | No | The paper mentions various algorithms and models used (e.g., PPO, Wocar, DQN, RADIAL, TD3+BC, CQL, GANs) with their respective citations, but it does not specify version numbers for programming languages, libraries, or frameworks (e.g., Python version, PyTorch/TensorFlow version, CUDA version). |
| Experiment Setup | Yes | In Appendix C, Section C.5 titled 'The Hyperparameter Choice of Lapse' provides a detailed Table 3 listing specific hyperparameter values for different environments. These include 'p value in LLp', 'λ in Lrecon', 'βmax in ˆLoff', 'τ in ˆLoff', 'αrobust in ˆLoff', 'ϵ in Bϵ', 'policy update interval', 'T recon max', 'T off max', 'learning rate', 'target update τTD3', 'γ', 'ema coefficient', 'ema interval', 'number of quantiles in QRDQN', and 'target update interval' with their corresponding values. |