FactorSim: Generative Simulation via Factorized Representation
Authors: Fan-Yun Sun, Harini S I, Angela Yi, Yihan Zhou, Alex Zook, Jonathan Tremblay, Logan Cross, Jiajun Wu, Nick Haber
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | For evaluation, we introduce a generative simulation benchmark that assesses the generated simulation code s accuracy and effectiveness in facilitating zero-shot transfers in reinforcement learning settings. We show that FACTORSIM outperforms existing methods in generating simulations regarding prompt alignment (i.e., accuracy), zero-shot transfer abilities, and human evaluation. We also demonstrate its effectiveness in generating robotic tasks. |
| Researcher Affiliation | Collaboration | Fan-Yun Sun Stanford University S. I. Harini Stanford University Angela Yi Stanford University Yihan Zhou Stanford University Alex Zook Nvidia Jonathan Tremblay Nvidia Logan Cross Stanford University Jiajun Wu Stanford University Nick Haber Stanford University |
| Pseudocode | Yes | Algorithm 1: FACTORSIM Input: Qtext, a natural language description of the simulation, and an LLM Output: a turing-computable simulation represented as a POMDP M = S, A, O, T, Ω, R |
| Open Source Code | Yes | We provide code in the supplementary material. |
| Open Datasets | Yes | For each RL game, the input prompt consists of the game s online documentation. Since most game documentation is incomplete, we manually supplement them with additional details (see Appendix). This ensures that our method and the baselines do not hallucinate any missing game information, allowing for a fair evaluation across all methods. |
| Dataset Splits | No | The paper does not explicitly mention a separate validation set or how it was used. It states that the 'best zero-shot performance on the testing environment' was reported, but this doesn't clarify if a distinct validation phase was used. |
| Hardware Specification | Yes | All experiments are done on a workstation with 8 Nvidia A40 GPUs and 1008G of RAM. |
| Software Dependencies | No | The paper mentions 'RLLib [20]' and specific LLM models like 'GPT-4 refers to the Open AI s gpt-4-1106-preview model, GPT-3.5 refers to Open AI s gpt-3.5-turbo model, and Llama-3 refers to the open-sourced meta-llama-3-70b-instruct model'. However, it does not provide specific version numbers for general software dependencies like Python, PyTorch, or CUDA, nor for RLLib itself, which are necessary for full reproducibility. |
| Experiment Setup | Yes | The PPO agent is trained with a batch size of 10,000, and an SGD minibatch size of 2048. Our agent used a fully connected network with hidden layers of sizes (4, 4) and post-FCNet hidden layers of size 16, all employing Re LU activation functions. The policy network uses an LSTM with a cell size of 64 to incorporate previous actions but not previous rewards. |