CodeIt: Self-Improving Language Models with Prioritized Hindsight Replay
Authors: Natasha Butt, Blazej Manczak, Auke Wiggers, Corrado Rainone, David W. Zhang, Michaël Defferrard, Taco Cohen
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Applying Code It to the ARC dataset, we demonstrate that prioritized hindsight replay, along with pre-training and data-augmentation, leads to successful inter-task generalization. Code It is the first neuro-symbolic approach that scales to the full ARC evaluation dataset. Our method solves 15% of ARC evaluation tasks, achieving state-of-the-art performance and outperforming existing neural and symbolic baselines. Our code is available at https://github. com/Qualcomm-AI-research/codeit. In this section, we aim to demonstrate the efficacy of Code It, and break down how much different components of the method contribute to the performance. We first tuned hyperparameters on a custom training and validation split (for a description of these parameters and details, see Appendix B). Using these hyperparameters, we benchmark our method on the ARC evaluation split and compare against previous state-of-the-art methods. Finally, we ablate the importance of individual components of Code It. |
| Researcher Affiliation | Collaboration | 1University of Amsterdam 2Qualcomm AI Research. Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc. 3Work was completed while an employee at Qualcomm Technologies Netherlands B.V.. |
| Pseudocode | Yes | For pseudocode, see Appendix A.1. The pseudo code for the Code It procedure is portrayed in Algorithm 1. |
| Open Source Code | Yes | Our code is available at https://github.com/ Qualcomm-AI-research/codeit. |
| Open Datasets | Yes | We initialize our training set with the 400 examples from the ARC training split and the associated solution programs provided by Hodel (2023). |
| Dataset Splits | Yes | We first tuned hyperparameters on a custom training and validation split (for a description of these parameters and details, see Appendix B). We choose the split such that Dtrain and Dvalid contain roughly equally difficult programs by sampling based on program length: Dtrain contains 80% of 2-line programs, 80% of 3-line programs, and so on. This results in 311 examples in Dtrain and 89 examples in Dvalid. |
| Hardware Specification | Yes | Experiments were run for a maximum of 120 hours on a NVIDIA A100 80GB. |
| Software Dependencies | No | The paper mentions using 'Code T5+' as its policy model and refers to a specific DSL implementation from Hodel (2023) but does not provide specific version numbers for software dependencies such as Python, PyTorch, or other libraries used in the experimental setup. |
| Experiment Setup | Yes | Table 3. Table of hyperparameters. Code It stage Param Value Description Sampling and Hindsight Relabeling nρ 24 no. policy samples ρ per task per meta-iteration nm 19, 200 no. mutated samples for augmented train set τ 0.95 sampling temperature rt 10, 000 number of experiences sampled from augmented train set rp 90, 000 number of experiences sampled from buffer Learning nϵ 1 no. train epochs per meta-iteration lr 5e 5 learning rate |