Causal language modeling can elicit search and reasoning capabilities on logic puzzles

Authors: Kulin Shah, Nishanth Dikkala, Xin Wang, Rina Panigrahy

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we study if causal language modeling can learn a complex task such as solving Sudoku puzzles. We observe that Transformer models trained on this synthetic task can indeed learn to solve Sudokus (our model solves 94.21% of the puzzles fully correctly) when trained on a logical sequence of steps taken by a solver. We find that training Transformers with the logical sequence of steps is necessary and without such training, they fail to learn Sudoku. We also extend our analysis to Zebra puzzles (known as Einstein puzzles) and show that the model solves 92.04% of the puzzles fully correctly.
Researcher Affiliation Collaboration Kulin Shah UT Austin kulinshah@utexas.edu Nishanth Dikkala Google Research nishanthd@google.com Xin Wang Google Research wanxin@google.com Rina Panigrahy Google Research rinap@google.com
Pseudocode No The paper describes algorithms and procedures in prose, but it does not contain clearly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes The code is available at https://github.com/kulinshah98/llm-reasoning-logic-puzzles
Open Datasets Yes We consider a dataset of Sudoku puzzles of varying difficulty levels from [Rad20].
Dataset Splits Yes Our training dataset for the Sudoku experiment contains 1.8M puzzles and the test dataset contains 0.1M puzzles. We randomly choose 0.1M puzzles from these puzzles and use them as a validation dataset for the evaluation of the model and the remaining 1.8M puzzles are part of our training dataset.
Hardware Specification No The paper does not specify the type of compute workers (e.g., GPU model, CPU, TPU) used for the experiments. It only mentions 'Transformer-based GPT-2 architecture'.
Software Dependencies No The paper mentions 'Transformer-based GPT-2' and 'Adam W optimizer' but does not provide specific version numbers for software libraries, frameworks, or programming languages used (e.g., Python, PyTorch/TensorFlow, CUDA versions).
Experiment Setup Yes We use the Adam W optimizer for our experiments. For all the experiments, learning rate is set to 1e-4 and models are trained for 4 million steps with a batch size of 64. We use the cosine learning rate schedule [LH16] with the first 4000 tokens as the warmup phase and an end learning rate factor of 0.2.