Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Training a Generally Curious Agent
Authors: Fahim Tajwar, Yiding Jiang, Abitha Thankaraj, Sumaita Sadia Rahman, J Zico Kolter, Jeff Schneider, Russ Salakhutdinov
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that models fine-tuned with PAPRIKA can effectively transfer their learned decision-making capabilities to entirely unseen tasks without additional training. |
| Researcher Affiliation | Academia | Fahim Tajwar * 1 Yiding Jiang * 1 Abitha Thankaraj 1 Sumaita Sadia Rahman 2 J Zico Kolter 1 Jeff Schneider 1 Russ Salakhutdinov 1 1CMU 2North Carolina State University. Correspondence to: Fahim Tajwar <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Task selection with UCB 1: Input: Number of arms K, number of samples C, number of rounds T, model π 2: Initialize: sk = 0, nk = 0, Buffer 3: for each round t = 1, 2, . . . , T do 4: Compute θk = sk 2 log PK k=1 nk nk for each k 5: Select k = arg maxk θk 6: Sample τ from group k 7: Sample C trajectories from τ and add to Buffer 8: Compute an estimate for ˆνπ(τ) using Eq 4 9: Update: sk = sk + ˆνπ(τ), nk = nk + 1 10: end for 11: Construct D from Buffer and train the model π |
| Open Source Code | Yes | Our codebase to reproduce the results in this paper can be found here: https://github.com/tajwarfahim/paprika |
| Open Datasets | Yes | We also release the datasets used to train our models. Our supervised fine-tuning dataset can be found here: https://huggingface.co/datasets/ftajwar/paprika_SFT_dataset. The dataset used during RPO finetuning can be found here: https://huggingface.co/datasets/ftajwar/paprika_preference_ dataset |
| Dataset Splits | Yes | This results in 17,181 training trajectories for supervised fine-tuning and 5,260 trajectory pairs for RPO over all task groups. This results in 477 easy, 726 medium, and 296 hard topics in the train split and 127 easy, 172 medium, and 68 hard topics in the test split. Table 1. Summary of the task groups used by PAPRIKA. Task Group # Train Tasks # Test Tasks |
| Hardware Specification | Yes | All our Llama-3.1-8B-Instruct models were trained using a single node consisting of 8 NVIDIA L40S GPUs. For training the Gemma-3-12B-IT models, we use a single node consisting of 8 NVIDIA H100 GPUs. For inference and generating data, we use single NVIDIA A40 GPUs. |
| Software Dependencies | No | The paper mentions Flash-Attention (Dao et al., 2022; Dao, 2024) but does not specify version numbers for any key software libraries or frameworks used for implementation. |
| Experiment Setup | Yes | Unless explicitly mentioned otherwise, we use learning rate of 10-6 for supervised fine-tuning and 2 × 10-7 for RPO. We use batch size 32 for all training runs. We generally always run supervised fine-tuning first and then further fine-tune with the RPO objective to obtain the final model unless explicitly mentioned otherwise. We use an AdamW optimizer (Loshchilov & Hutter, 2019) with a cosine annealing learning rate scheduler and warmup ratio 0.04 (Loshchilov & Hutter, 2017) to train all our models. For data generation, we use Min-p sampling (Nguyen et al., 2024) with temperature 1.5 and Min-p parameter 0.3, as we saw that this setting consistently generated diverse training data that resulted in higher test-time accuracy. Default temperature for evaluation is set to 0.7. |