Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Training a Generally Curious Agent

Authors: Fahim Tajwar, Yiding Jiang, Abitha Thankaraj, Sumaita Sadia Rahman, J Zico Kolter, Jeff Schneider, Russ Salakhutdinov

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that models fine-tuned with PAPRIKA can effectively transfer their learned decision-making capabilities to entirely unseen tasks without additional training.
Researcher Affiliation	Academia	Fahim Tajwar * 1 Yiding Jiang * 1 Abitha Thankaraj 1 Sumaita Sadia Rahman 2 J Zico Kolter 1 Jeff Schneider 1 Russ Salakhutdinov 1 1CMU 2North Carolina State University. Correspondence to: Fahim Tajwar <EMAIL>.
Pseudocode	Yes	Algorithm 1 Task selection with UCB 1: Input: Number of arms K, number of samples C, number of rounds T, model π 2: Initialize: sk = 0, nk = 0, Buffer 3: for each round t = 1, 2, . . . , T do 4: Compute θk = sk 2 log PK k=1 nk nk for each k 5: Select k = arg maxk θk 6: Sample τ from group k 7: Sample C trajectories from τ and add to Buffer 8: Compute an estimate for ˆνπ(τ) using Eq 4 9: Update: sk = sk + ˆνπ(τ), nk = nk + 1 10: end for 11: Construct D from Buffer and train the model π
Open Source Code	Yes	Our codebase to reproduce the results in this paper can be found here: https://github.com/tajwarfahim/paprika
Open Datasets	Yes	We also release the datasets used to train our models. Our supervised fine-tuning dataset can be found here: https://huggingface.co/datasets/ftajwar/paprika_SFT_dataset. The dataset used during RPO finetuning can be found here: https://huggingface.co/datasets/ftajwar/paprika_preference_ dataset
Dataset Splits	Yes	This results in 17,181 training trajectories for supervised fine-tuning and 5,260 trajectory pairs for RPO over all task groups. This results in 477 easy, 726 medium, and 296 hard topics in the train split and 127 easy, 172 medium, and 68 hard topics in the test split. Table 1. Summary of the task groups used by PAPRIKA. Task Group # Train Tasks # Test Tasks
Hardware Specification	Yes	All our Llama-3.1-8B-Instruct models were trained using a single node consisting of 8 NVIDIA L40S GPUs. For training the Gemma-3-12B-IT models, we use a single node consisting of 8 NVIDIA H100 GPUs. For inference and generating data, we use single NVIDIA A40 GPUs.
Software Dependencies	No	The paper mentions Flash-Attention (Dao et al., 2022; Dao, 2024) but does not specify version numbers for any key software libraries or frameworks used for implementation.
Experiment Setup	Yes	Unless explicitly mentioned otherwise, we use learning rate of 10-6 for supervised fine-tuning and 2 × 10-7 for RPO. We use batch size 32 for all training runs. We generally always run supervised fine-tuning first and then further fine-tune with the RPO objective to obtain the final model unless explicitly mentioned otherwise. We use an AdamW optimizer (Loshchilov & Hutter, 2019) with a cosine annealing learning rate scheduler and warmup ratio 0.04 (Loshchilov & Hutter, 2017) to train all our models. For data generation, we use Min-p sampling (Nguyen et al., 2024) with temperature 1.5 and Min-p parameter 0.3, as we saw that this setting consistently generated diverse training data that resulted in higher test-time accuracy. Default temperature for evaluation is set to 0.7.