reproducibilityindex.ai

FactorSim: Generative Simulation via Factorized Representation

Authors: Fan-Yun Sun, Harini S I, Angela Yi, Yihan Zhou, Alex Zook, Jonathan Tremblay, Logan Cross, Jiajun Wu, Nick Haber

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	For evaluation, we introduce a generative simulation benchmark that assesses the generated simulation code s accuracy and effectiveness in facilitating zero-shot transfers in reinforcement learning settings. We show that FACTORSIM outperforms existing methods in generating simulations regarding prompt alignment (i.e., accuracy), zero-shot transfer abilities, and human evaluation. We also demonstrate its effectiveness in generating robotic tasks.
Researcher Affiliation	Collaboration	Fan-Yun Sun Stanford University S. I. Harini Stanford University Angela Yi Stanford University Yihan Zhou Stanford University Alex Zook Nvidia Jonathan Tremblay Nvidia Logan Cross Stanford University Jiajun Wu Stanford University Nick Haber Stanford University
Pseudocode	Yes	Algorithm 1: FACTORSIM Input: Qtext, a natural language description of the simulation, and an LLM Output: a turing-computable simulation represented as a POMDP M = S, A, O, T, Ω, R
Open Source Code	Yes	We provide code in the supplementary material.
Open Datasets	Yes	For each RL game, the input prompt consists of the game s online documentation. Since most game documentation is incomplete, we manually supplement them with additional details (see Appendix). This ensures that our method and the baselines do not hallucinate any missing game information, allowing for a fair evaluation across all methods.
Dataset Splits	No	The paper does not explicitly mention a separate validation set or how it was used. It states that the 'best zero-shot performance on the testing environment' was reported, but this doesn't clarify if a distinct validation phase was used.
Hardware Specification	Yes	All experiments are done on a workstation with 8 Nvidia A40 GPUs and 1008G of RAM.
Software Dependencies	No	The paper mentions 'RLLib [20]' and specific LLM models like 'GPT-4 refers to the Open AI s gpt-4-1106-preview model, GPT-3.5 refers to Open AI s gpt-3.5-turbo model, and Llama-3 refers to the open-sourced meta-llama-3-70b-instruct model'. However, it does not provide specific version numbers for general software dependencies like Python, PyTorch, or CUDA, nor for RLLib itself, which are necessary for full reproducibility.
Experiment Setup	Yes	The PPO agent is trained with a batch size of 10,000, and an SGD minibatch size of 2048. Our agent used a fully connected network with hidden layers of sizes (4, 4) and post-FCNet hidden layers of size 16, all employing Re LU activation functions. The policy network uses an LSTM with a cell size of 64 to incorporate previous actions but not previous rewards.