Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Learning the Optimal Policy for Balancing Short-Term and Long-Term Rewards
Authors: Qinwei Yang, Xueqing Liu, Yan Zeng, Ruocheng Guo, Yang Liu, Peng Wu
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments are conducted on the proposed method and the results validate the effectiveness of the method. |
| Researcher Affiliation | Collaboration | Qinwei Yang1, Xueqing Liu1, Yan Zeng1, Ruocheng Guo2, Yang Liu3, Peng Wu1 1Beijing Technology and Business University 2Byte Dance Research 3UC Santa Cruz |
| Pseudocode | Yes | Appendix B Algorithm Flowchart for DPPL Algorithm 1 DPPL Algorithm |
| Open Source Code | Yes | In addition, we provide the datasets and codes in supplemental material to ensure easy reproduction of all reported results. |
| Open Datasets | Yes | Following the previous studies [8], we use two widely used datasets: IHDP and JOBS, for evaluating the performance of the proposed method. |
| Dataset Splits | No | The paper does not explicitly mention training, validation, and test splits for the datasets. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for experiments. |
| Software Dependencies | No | The paper does not mention specific software dependencies or their version numbers. |
| Experiment Setup | Yes | Simulating Outcome. Consider the case of one long-term reward and one short-term reward. Following the previous data-generation mechanisms [1, 42], for the n-th unit (n = 1, ..., N), we simulate the potential short-term outcomes S(0) and S(1) as follows: Sn(0) Bern(σ(w0Xn + ϵ0,n)), Sn(1) Bern(σ(w1Xn + ϵ1,n)), where σ( ) is the sigmoid function, w0 N[ 1,1](0, 1) follows a truncated normal distribution, w1 Unif( 1, 1) follows a uniform distribution, ϵ0,n N(µ0, σ0) and ϵ1,n N(µ1, σ1). We set µ0 = 1, µ1 = 3 and σ0 = σ1 = 1 for the IHDP dataset, and we set µ0 = 0, µ1 = 2 and σ0 = σ1 = 1 for the JOBS dataset. ... Experimental Details. ... We choose MLP as the policy model π(θ), and we average over 50 independent trials of policy learning with the short-term and long-term reward in IHDP and JOBS. We fix the missing ratio r = 0.2 and the time step T = 4. |