ACES: Generating a Diversity of Challenging Programming Puzzles with Autotelic Generative Models

Authors: Julien Pourcel, Cédric Colas, Gaia Molinaro, Pierre-Yves Oudeyer, Laetitia Teodorescu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We propose a method that aims to automate this process by harnessing the power of state-of-the-art generative models to produce a diversity of challenging yet solvable problems, here in the context of Python programming puzzles. [...] ACES generates problems that are more diverse and more challenging than problems produced by baseline methods and three times more challenging than problems found in existing Python programming benchmarks on average across 11 state-of-the-art code LLMs.
Researcher Affiliation Collaboration Julien Pourcel, Inria Cédric Colas MIT, Inria Gaia Molinaro University of California, Berkeley Pierre-Yves Oudeyer Inria Laetitia Teodorescu Inria
Pseudocode No The paper includes an 'Overview of the ACES algorithm' in Figure 1 with a flowchart, but no formal pseudocode or algorithm block.
Open Source Code Yes While it should be possible to reproduce our data generation process from this information alone, the code is available at https://github.com/Julien-pour/Open ELM/tree/imgep-qdaif.
Open Datasets Yes The Python Programming Puzzles dataset (P3) contains 1715 puzzle-solution pairs [...] The P3 dataset is split into training and testing datasets (N = 636 and 1079 respectively).
Dataset Splits No The paper explicitly states train and test splits for the P3 dataset, but does not mention a distinct validation split.
Hardware Specification Yes Each experiment was performed on 1 node of 4 Nvidia Tesla V100 SXM2 32 GB, with 160 GB RAM, for about 20 hours using the v LLM library [Kwon et al., 2023].
Software Dependencies Yes Puzzle generation, solution generation, description generation, and puzzle labeling are all implemented with the state-of-the-art open source model Llama 3 70B, quantized in 4 bits, with a temperature parameter of 0.8. [...] using the v LLM library [Kwon et al., 2023].
Experiment Setup Yes Puzzle generation, solution generation, description generation, and puzzle labeling are all implemented with the state-of-the-art open source model Llama 3 70B, quantized in 4 bits, with a temperature parameter of 0.8. [...] We repeat all experiments using 3 different random seeds and report the means and standard deviations of the results. [...] Each experiment is run for 40 generations, where each generation corresponds to 160 puzzles generated by the puzzle generator a total of 6400 puzzle generation attempts per run.