Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Scalable Policy-Based RL Algorithms for POMDPs

Authors: Ameya Anjarlekar, S. Rasoul Etesami, R. Srikant

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The paper includes a dedicated section C. Experimental Results which details the evaluation of the algorithm on a partially observable variant of the Frozen Lake-v1 environment, and presents results with plots showing average reward per episode for varying history lengths and observation noise levels, and a comparison of moving average reward. This indicates empirical studies with data analysis.
Researcher Affiliation Academia Ameya Anjarlekar UIUC EMAIL S. Rasoul Etesami UIUC EMAIL R. Srikant UIUC EMAIL. All authors are affiliated with UIUC, which stands for the University of Illinois Urbana-Champaign, an academic institution. The email domains also end in .edu.
Pseudocode Yes Algorithm 1: An Approximate TD Learning Algorithm for Superstate MDP. Algorithm 2: A Policy Optimization Based Algorithm to learn the Superstate MDP. Algorithm 3: A Greedy Algorithm to construct α.
Open Source Code Yes All the implementation code is available at this code repository: https://github.com/ameyanjarlekar/Policy-Based-RL-For-POMDPs.
Open Datasets Yes We evaluate the performance of our algorithm on a partially observable variant of the Frozen Lake-v1 environment. The Frozen Lake-v1 environment is a well-known, publicly available environment in the OpenAI Gym.
Dataset Splits No The paper mentions 'The agent trains over 200 episodes, each of fixed length 20 steps.' This describes the training duration but does not specify traditional dataset splits (e.g., train/test/validation percentages or counts) as data is generated through interaction with the environment rather than being a static dataset.
Hardware Specification No All the experiments were performed on the Google Colab CPU. This specifies the type of computing environment (CPU via Google Colab) but lacks specific hardware details such as CPU model, memory, or number of cores.
Software Dependencies No The paper mentions the use of 'POLITEX with TD(0)-based Q-value estimation and exponentiated gradient policy updates' and refers to 'OpenAI Gym' indirectly through 'Frozen Lake-v1 environment', but does not provide specific version numbers for any software libraries, frameworks, or programming languages used.
Experiment Setup Yes The agent trains over 200 episodes, each of fixed length 20 steps. The learning rate α is set to 0.1 and the discount factor γ to 0.9. Policies are represented tabularly and updated greedily with respect to Q-values after each episode.