Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Adaptive Neighborhood-Constrained Q Learning for Offline Reinforcement Learning

Authors: Yixiu Mao, Yun Qu, Qi (Cheems) Wang, Xiangyang Ji

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, ANQ achieves state-of-the-art performance on standard offline RL benchmarks and exhibits strong robustness in scenarios with noisy or limited data. The code is available at https://github.com/thu-rllab/ANQ. ... We conduct experiments to evaluate the performance and properties of the proposed approach ANQ. Experimental details and extended results are provided in Appendices C and D, respectively.
Researcher Affiliation Academia Yixiu Mao1, Yun Qu1, Qi Wang1, Xiangyang Ji1 1Department of Automation, Tsinghua University EMAIL, EMAIL
Pseudocode Yes Algorithm 1 ANQ 1: Initialize policy πϕ, auxiliary policy µω, target auxiliary policy µω , Q-network Qθ, target Q-network Qθ , and V-network Vψ. 2: for each gradient step do 3: Update ψ by minimizing Eq. (14) 4: Update θ by minimizing Eq. (15) 5: Update ω by maximizing Eq. (13) 6: Update ϕ by maximizing Eq. (17) 7: Update target networks: θ (1 ξ)θ + ξθ, ω (1 ξ)ω + ξω 8: end for
Open Source Code Yes Empirically, ANQ achieves state-of-the-art performance on standard offline RL benchmarks and exhibits strong robustness in scenarios with noisy or limited data. The code is available at https://github.com/thu-rllab/ANQ.
Open Datasets Yes Empirically, ANQ achieves state-of-the-art performance on standard offline RL benchmarks [20], including Gym locomotion tasks and challenging Ant Maze tasks. ... [20] Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning. ar Xiv preprint ar Xiv:2004.07219, 2020.
Dataset Splits Yes We assess ANQ on two distinct task suites from D4RL [20]: the Gym-Mu Jo Co locomotion domains and the challenging Ant Maze domains. ... Following the D4RL benchmark guidelines [20], we subtract 1 from the rewards in the Ant Maze datasets. ... For the Gym locomotion tasks, performance is assessed by averaging returns over 10 evaluation trajectories and 5 random seeds. In the Ant Maze suite, returns are averaged over 100 evaluation trajectories across the same number of seeds.
Hardware Specification Yes We test the runtime of ANQ and some baseline methods on a Ge Force RTX 3090. ... Runtime of algorithms on halfcheetah-medium-replay-v2 on a Ge Force RTX 3090.
Software Dependencies No The paper mentions 'Optimizer Adam [37]' but does not provide a specific version number for Adam or any other software libraries or frameworks. It also states 'Our implementation builds upon TD3 [22]', which refers to an algorithm, not a software package with a version.
Experiment Setup Yes A comprehensive list of hyperparameter settings for ANQ is provided in Table 3. Hyperparameter Value Optimizer Adam [37] Critic learning rate 3e-4 Actor learning rate 3e-4 with cosine schedule Discount factor 0.99 for Gym, 0.995 for Antmaze Target update rate 0.005 Policy update frequency 2 Number of Critics 4 Batch size 256 Number of iterations 10^6 Lagrange multiplier λ {0.1, 5.0} Inverse temperature α 1 IQL Specific Expectile τ 0.7 for Gym, 0.9 for Antmaze Inverse temperature β 3.0 for Gym, 10.0 for Antmaze Architecture Actor input-256-256-output Critic input-256-256-1