Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
A Reductions Approach to Risk-Sensitive Reinforcement Learning with Optimized Certainty Equivalents
Authors: Kaiwen Wang, Dawen Liang, Nathan Kallus, Wen Sun
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5. Simulation Experiments We describe a numerical simulation to demonstrate the importance of learning history-dependent policies for OCE RL and to empirically evaluate our algorithms. Our code can be found at https://github.com/kaiwenw/oce-rl. Setting up synthetic MDP. The proof-of-concept MDP is shown in Figure 1 and has two states. ... Experiment with tabular policies. ... Experiment with neural network policies. ... We plot the learning curves in Figure 2... |
| Researcher Affiliation | Collaboration | 1Cornell Tech 2Netflix Research. Correspondence to: Kaiwen Wang <kaiwenw.github.io>. Work done as Netflix intern. |
| Pseudocode | Yes | Algorithm 1 Meta-algorithm for optimistic oracles 1: Input: number of rounds K, optimistic oracle OPTALG satisfying Def. 3.1. 2: for round k = 1, 2, . . . , K do 3: Query OPTALG in Aug MDP for value func. b V1,k( ). |
| Open Source Code | Yes | Our code can be found at https://github.com/kaiwenw/oce-rl. |
| Open Datasets | No | Setting up synthetic MDP. The proof-of-concept MDP is shown in Figure 1 and has two states. At s1, all actions lead to a random reward r1 Ber(0.5) and transits to s2. At s2, the first action a1 gives a random reward r2 | s2, a1 1.5 Ber(0.75), while another action a2 gives a deterministic reward r2 | s2, a2 = 0.5. The trajectory ends after s2. (The paper defines the MDP for simulation but does not provide access information for a publicly available dataset.) |
| Dataset Splits | No | The paper uses a synthetic MDP for its simulation experiments, which is an environment defined for a proof-of-concept rather than a dataset with traditional train/test/validation splits. The text mentions 'We repeat runs five times' which refers to experiment repetitions, not data partitioning for model training. |
| Hardware Specification | No | The paper does not provide specific details on the hardware used to run the experiments, such as CPU or GPU models, or cloud computing specifications. |
| Software Dependencies | No | The paper mentions deep RL oracles like PPO and REINFORCE, and the Adam optimizer, but does not specify their version numbers or any other software dependencies with their exact versions. |
| Experiment Setup | Yes | C. More Details on Experimental Setup... Table 5. Hyperparameter settings used in our experiments. Component Value/Description Policy Network Softmax policy with MLP with two hidden layers of dimension 64 Value Network MLP with two hidden layers of dimension 64 Optimizer Adam with β1 = 0.9, β2 = 0.999 Batch Size 256 Learning Rate 5e-3 PPO KL weight 0.1 Regularization Weight 0.1 |