diff History for Neural Language Agents

Authors: Ulyana Piterbarg, Lerrel Pinto, Rob Fergus

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct a series of experiments testing instruction finetuning with diff history in two different environments: the low-dimensional, multi-task Baby AI-Text environment (Carta et al., 2023) and the high-dimensional, as-of-yet unsolved video game Net Hack (K uttler et al., 2020).
Researcher Affiliation Academia 1Dept of Computer Science, Courant Institute, NYU. Correspondence to: Ulyana Piterbarg <up2021@cims.nyu.edu>.
Pseudocode No No structured pseudocode or algorithm blocks were found.
Open Source Code Yes We open-source our code and data to https://diffhistory.github.io.
Open Datasets Yes We open-source our code and data to https://diffhistory.github.io.
Dataset Splits Yes Models were tuned with action prediction losses for a single epoch on a 80/20 split of each tuning dataset and were validated for perplexity on the held-out portion.
Hardware Specification Yes Finetuning is conducted in mixed-precision with Microsoft Deep Speed (Wang et al., 2023a) and Hugging Face Accelerate (Gugger et al., 2022) on NVIDIA A100 GPU nodes. All neural LMs are evaluated on single NVIDIA RTX8000, A100, or A4000 GPUs on an academic high performance computing cluster.
Software Dependencies No The paper mentions software like PyTorch, Hugging Face Transformers, Microsoft Deep Speed, and Hugging Face Accelerate, but does not provide specific version numbers for these software components.
Experiment Setup Yes We employ a batch size of 250 and a 32-epoch linear learning rate schedule with a warm-up ratio of 0.03 in all training and finetuning experiments with the exception of the ultra-low data Baby AI-Text experiment (1K demonstrations only). In this experiment, we tune models with the same batch size and learning rate schedule but for 64 epochs. We found γ = 3 10 4 to consistently be the best performing learning rate. We preserve the default 1024-token context length of the model in Baby AI-Text, but extend model context lengths by a factor of four up to 4096 tokens for finetuning on Lang Hack by introducing new positional encodings.