CodeIt: Self-Improving Language Models with Prioritized Hindsight Replay

Authors: Natasha Butt, Blazej Manczak, Auke Wiggers, Corrado Rainone, David W. Zhang, Michaël Defferrard, Taco Cohen

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Applying Code It to the ARC dataset, we demonstrate that prioritized hindsight replay, along with pre-training and data-augmentation, leads to successful inter-task generalization. Code It is the first neuro-symbolic approach that scales to the full ARC evaluation dataset. Our method solves 15% of ARC evaluation tasks, achieving state-of-the-art performance and outperforming existing neural and symbolic baselines. Our code is available at https://github. com/Qualcomm-AI-research/codeit. In this section, we aim to demonstrate the efficacy of Code It, and break down how much different components of the method contribute to the performance. We first tuned hyperparameters on a custom training and validation split (for a description of these parameters and details, see Appendix B). Using these hyperparameters, we benchmark our method on the ARC evaluation split and compare against previous state-of-the-art methods. Finally, we ablate the importance of individual components of Code It.
Researcher Affiliation Collaboration 1University of Amsterdam 2Qualcomm AI Research. Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc. 3Work was completed while an employee at Qualcomm Technologies Netherlands B.V..
Pseudocode Yes For pseudocode, see Appendix A.1. The pseudo code for the Code It procedure is portrayed in Algorithm 1.
Open Source Code Yes Our code is available at https://github.com/ Qualcomm-AI-research/codeit.
Open Datasets Yes We initialize our training set with the 400 examples from the ARC training split and the associated solution programs provided by Hodel (2023).
Dataset Splits Yes We first tuned hyperparameters on a custom training and validation split (for a description of these parameters and details, see Appendix B). We choose the split such that Dtrain and Dvalid contain roughly equally difficult programs by sampling based on program length: Dtrain contains 80% of 2-line programs, 80% of 3-line programs, and so on. This results in 311 examples in Dtrain and 89 examples in Dvalid.
Hardware Specification Yes Experiments were run for a maximum of 120 hours on a NVIDIA A100 80GB.
Software Dependencies No The paper mentions using 'Code T5+' as its policy model and refers to a specific DSL implementation from Hodel (2023) but does not provide specific version numbers for software dependencies such as Python, PyTorch, or other libraries used in the experimental setup.
Experiment Setup Yes Table 3. Table of hyperparameters. Code It stage Param Value Description Sampling and Hindsight Relabeling nρ 24 no. policy samples ρ per task per meta-iteration nm 19, 200 no. mutated samples for augmented train set τ 0.95 sampling temperature rt 10, 000 number of experiences sampled from augmented train set rp 90, 000 number of experiences sampled from buffer Learning nϵ 1 no. train epochs per meta-iteration lr 5e 5 learning rate