Learning the Optimal Policy for Balancing Short-Term and Long-Term Rewards
Authors: Qinwei Yang, Xueqing Liu, Yan Zeng, Ruocheng Guo, Yang Liu, Peng Wu
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments are conducted on the proposed method and the results validate the effectiveness of the method. |
| Researcher Affiliation | Collaboration | Qinwei Yang1, Xueqing Liu1, Yan Zeng1, Ruocheng Guo2, Yang Liu3, Peng Wu1 1Beijing Technology and Business University 2Byte Dance Research 3UC Santa Cruz |
| Pseudocode | Yes | Appendix B Algorithm Flowchart for DPPL Algorithm 1 DPPL Algorithm |
| Open Source Code | Yes | In addition, we provide the datasets and codes in supplemental material to ensure easy reproduction of all reported results. |
| Open Datasets | Yes | Following the previous studies [8], we use two widely used datasets: IHDP and JOBS, for evaluating the performance of the proposed method. |
| Dataset Splits | No | The paper does not explicitly mention training, validation, and test splits for the datasets. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for experiments. |
| Software Dependencies | No | The paper does not mention specific software dependencies or their version numbers. |
| Experiment Setup | Yes | Simulating Outcome. Consider the case of one long-term reward and one short-term reward. Following the previous data-generation mechanisms [1, 42], for the n-th unit (n = 1, ..., N), we simulate the potential short-term outcomes S(0) and S(1) as follows: Sn(0) Bern(σ(w0Xn + ϵ0,n)), Sn(1) Bern(σ(w1Xn + ϵ1,n)), where σ( ) is the sigmoid function, w0 N[ 1,1](0, 1) follows a truncated normal distribution, w1 Unif( 1, 1) follows a uniform distribution, ϵ0,n N(µ0, σ0) and ϵ1,n N(µ1, σ1). We set µ0 = 1, µ1 = 3 and σ0 = σ1 = 1 for the IHDP dataset, and we set µ0 = 0, µ1 = 2 and σ0 = σ1 = 1 for the JOBS dataset. ... Experimental Details. ... We choose MLP as the policy model π(θ), and we average over 50 independent trials of policy learning with the short-term and long-term reward in IHDP and JOBS. We fix the missing ratio r = 0.2 and the time step T = 4. |