diff History for Neural Language Agents
Authors: Ulyana Piterbarg, Lerrel Pinto, Rob Fergus
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct a series of experiments testing instruction finetuning with diff history in two different environments: the low-dimensional, multi-task Baby AI-Text environment (Carta et al., 2023) and the high-dimensional, as-of-yet unsolved video game Net Hack (K uttler et al., 2020). |
| Researcher Affiliation | Academia | 1Dept of Computer Science, Courant Institute, NYU. Correspondence to: Ulyana Piterbarg <up2021@cims.nyu.edu>. |
| Pseudocode | No | No structured pseudocode or algorithm blocks were found. |
| Open Source Code | Yes | We open-source our code and data to https://diffhistory.github.io. |
| Open Datasets | Yes | We open-source our code and data to https://diffhistory.github.io. |
| Dataset Splits | Yes | Models were tuned with action prediction losses for a single epoch on a 80/20 split of each tuning dataset and were validated for perplexity on the held-out portion. |
| Hardware Specification | Yes | Finetuning is conducted in mixed-precision with Microsoft Deep Speed (Wang et al., 2023a) and Hugging Face Accelerate (Gugger et al., 2022) on NVIDIA A100 GPU nodes. All neural LMs are evaluated on single NVIDIA RTX8000, A100, or A4000 GPUs on an academic high performance computing cluster. |
| Software Dependencies | No | The paper mentions software like PyTorch, Hugging Face Transformers, Microsoft Deep Speed, and Hugging Face Accelerate, but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | We employ a batch size of 250 and a 32-epoch linear learning rate schedule with a warm-up ratio of 0.03 in all training and finetuning experiments with the exception of the ultra-low data Baby AI-Text experiment (1K demonstrations only). In this experiment, we tune models with the same batch size and learning rate schedule but for 64 epochs. We found γ = 3 10 4 to consistently be the best performing learning rate. We preserve the default 1024-token context length of the model in Baby AI-Text, but extend model context lengths by a factor of four up to 4096 tokens for finetuning on Lang Hack by introducing new positional encodings. |