reproducibilityindex.ai

diff History for Neural Language Agents

Authors: Ulyana Piterbarg, Lerrel Pinto, Rob Fergus

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct a series of experiments testing instruction finetuning with diff history in two different environments: the low-dimensional, multi-task Baby AI-Text environment (Carta et al., 2023) and the high-dimensional, as-of-yet unsolved video game Net Hack (K uttler et al., 2020).
Researcher Affiliation	Academia	1Dept of Computer Science, Courant Institute, NYU. Correspondence to: Ulyana Piterbarg <up2021@cims.nyu.edu>.
Pseudocode	No	No structured pseudocode or algorithm blocks were found.
Open Source Code	Yes	We open-source our code and data to https://diffhistory.github.io.
Open Datasets	Yes	We open-source our code and data to https://diffhistory.github.io.
Dataset Splits	Yes	Models were tuned with action prediction losses for a single epoch on a 80/20 split of each tuning dataset and were validated for perplexity on the held-out portion.
Hardware Specification	Yes	Finetuning is conducted in mixed-precision with Microsoft Deep Speed (Wang et al., 2023a) and Hugging Face Accelerate (Gugger et al., 2022) on NVIDIA A100 GPU nodes. All neural LMs are evaluated on single NVIDIA RTX8000, A100, or A4000 GPUs on an academic high performance computing cluster.
Software Dependencies	No	The paper mentions software like PyTorch, Hugging Face Transformers, Microsoft Deep Speed, and Hugging Face Accelerate, but does not provide specific version numbers for these software components.
Experiment Setup	Yes	We employ a batch size of 250 and a 32-epoch linear learning rate schedule with a warm-up ratio of 0.03 in all training and finetuning experiments with the exception of the ultra-low data Baby AI-Text experiment (1K demonstrations only). In this experiment, we tune models with the same batch size and learning rate schedule but for 64 epochs. We found γ = 3 10 4 to consistently be the best performing learning rate. We preserve the default 1024-token context length of the model in Baby AI-Text, but extend model context lengths by a factor of four up to 4096 tokens for finetuning on Lang Hack by introducing new positional encodings.