Policy Learning for Balancing Short-Term and Long-Term Rewards
Authors: Peng Wu, Ziyu Shen, Feng Xie, Wang Zhongyao, Chunchen Liu, Yan Zeng
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments are conducted to validate the effectiveness of the proposed method, demonstrating its practical applicability. |
| Researcher Affiliation | Collaboration | 1School of Mathematics and Statistics, Beijing Technology and Business University 2Ling Yang, Alibaba Group, Hangzhou, China. |
| Pseudocode | No | The paper describes its method in prose, outlining steps for policy evaluation and learning, and provides mathematical formulations. However, it does not include a formalized pseudocode block or an algorithm figure. |
| Open Source Code | Yes | 1Please note that our code is available at https://github. com/Yana Zeng/Short_long_term-Rewards. |
| Open Datasets | Yes | We perform extensive experiments on three widely used benchmark datasets, IHDP (Hill, 2011), JOBS (La Londe, 1986), and PRODUCT (Gao et al., 2022). |
| Dataset Splits | No | The paper mentions running '50 independent trials' and describes how data is simulated, but it does not specify explicit training/validation/test dataset splits (e.g., percentages or sample counts for each partition). |
| Hardware Specification | No | The paper describes its experimental setup but does not provide any specific details about the hardware (e.g., GPU model, CPU type, memory) used to conduct the experiments. |
| Software Dependencies | No | The paper mentions the use of 'machine learning methods' and the 'Adam optimization method' but does not specify any software names with version numbers for libraries, frameworks, or programming languages (e.g., Python version, PyTorch/TensorFlow version). |
| Experiment Setup | Yes | We simulate the potential short-term outcomes as follows: Si(0) Bern(σ(w0Xi + ϵ0,i)), Si(1) Bern(σ(w1Xi + ϵ1,i))... We set µ0 = 1, µ1 = 3 and σ0 = σ1 = 1 for IHDP dataset... Yt,i(0) N(β0Xi, 1) + C Xt 1 j=0 Yj,i(0)... where C = 0.02 is a scaling factor... we set the initial value at time step 0 as Y0,i(0) = Si(0), Y0,i(1) = Si(1), then generate Yt,i(0), Yt,i(1) following Eq.(7), and we eventually regard the outcome at the last time step T as the long-term reward, Yi(0) = YT,i(0), Yi(1) = YT,i(1). ... we fix the missing ratio of outcomes Y to be 0.1 and the number of time steps is T = 10. For ease of comparison, we transform the optimization problem into arg maxπ Π(1 λ)ˆV(π; s) + λˆV(π; y), where λ is a balance factor between short and long-term rewards. ... we compare three different optimization strategies: NAIVE-S (λ = 0), NAIVE-Y (λ = 1), and Ours (Balanced, λ = 0.5). |