Learning the Optimal Policy for Balancing Short-Term and Long-Term Rewards

Authors: Qinwei Yang, Xueqing Liu, Yan Zeng, Ruocheng Guo, Yang Liu, Peng Wu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments are conducted on the proposed method and the results validate the effectiveness of the method.
Researcher Affiliation Collaboration Qinwei Yang1, Xueqing Liu1, Yan Zeng1, Ruocheng Guo2, Yang Liu3, Peng Wu1 1Beijing Technology and Business University 2Byte Dance Research 3UC Santa Cruz
Pseudocode Yes Appendix B Algorithm Flowchart for DPPL Algorithm 1 DPPL Algorithm
Open Source Code Yes In addition, we provide the datasets and codes in supplemental material to ensure easy reproduction of all reported results.
Open Datasets Yes Following the previous studies [8], we use two widely used datasets: IHDP and JOBS, for evaluating the performance of the proposed method.
Dataset Splits No The paper does not explicitly mention training, validation, and test splits for the datasets.
Hardware Specification No The paper does not provide specific details about the hardware used for experiments.
Software Dependencies No The paper does not mention specific software dependencies or their version numbers.
Experiment Setup Yes Simulating Outcome. Consider the case of one long-term reward and one short-term reward. Following the previous data-generation mechanisms [1, 42], for the n-th unit (n = 1, ..., N), we simulate the potential short-term outcomes S(0) and S(1) as follows: Sn(0) Bern(σ(w0Xn + ϵ0,n)), Sn(1) Bern(σ(w1Xn + ϵ1,n)), where σ( ) is the sigmoid function, w0 N[ 1,1](0, 1) follows a truncated normal distribution, w1 Unif( 1, 1) follows a uniform distribution, ϵ0,n N(µ0, σ0) and ϵ1,n N(µ1, σ1). We set µ0 = 1, µ1 = 3 and σ0 = σ1 = 1 for the IHDP dataset, and we set µ0 = 0, µ1 = 2 and σ0 = σ1 = 1 for the JOBS dataset. ... Experimental Details. ... We choose MLP as the policy model π(θ), and we average over 50 independent trials of policy learning with the short-term and long-term reward in IHDP and JOBS. We fix the missing ratio r = 0.2 and the time step T = 4.