Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Curriculum Design for Trajectory-Constrained Agent: Compressing Chain-of-Thought Tokens in LLMs

Authors: Georgios Tzannetos, Parameswaran Kamalaruban, Adish Singla

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Moreover, we empirically validate the effectiveness and generality of our method across both RL and large language model (LLM) agents in diverse settings, including a binary-tree MDP, a multi-task navigation domain, and a math reasoning task with two benchmarks. These results highlight the potential of curriculum design in enhancing the efficiency and performance of agents operating under complex trajectory constraints during deployment.
Researcher Affiliation Collaboration Georgios Tzannetos MPI-SWS EMAIL Parameswaran Kamalaruban Featurespace Innovation Lab, Visa EMAIL Adish Singla MPI-SWS EMAIL
Pseudocode Yes Algorithm 1 Training RL Agents with Deployment-time Constraints Algorithm 2 Teacher Component: Curriculum Strategy for Picking αt (Line 4 of Algorithm 1)
Open Source Code Yes 1Github: https://github.com/machine-teaching-group/neurips2025-curriculum-llm-tokens
Open Datasets Yes To ensure the generality of our evaluation, we assess the performance of our algorithm on two mathematical reasoning benchmarks: SVAMP [46], and GSM8K [47].
Dataset Splits Yes Test-set Math Question: Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: John adopts a dog. He takes the dog to the groomer, which costs $100. The groomer offers him a 30% discount for being a new customer. How much does the grooming cost? ### Response: Let s think step by step. For both SVAMP [46] and GSM8K [47], we filter out prompts exceeding 512 tokens.
Hardware Specification Yes We conducted the LLM experiments on a SLURM cluster comprising nodes with eight Nvidia H100 GPUs. Using those resources, the longest LLM experiment ran for approximately three days. We conducted the RL experiments on a cluster comprising nodes equipped with Intel Xeon Gold CPUs. Table 1: Comparison of models in terms of inference metrics when deployed on various consumer hardware configurations. Results are reported for inference on SVAMP test-set with QWEN as base model used for fine-tuning. We consider the following inference metrics: (a) Response Time , (b) Response Length , (c) Accuracy , and (d) Constr. Accuracy . Here, Accuracy denotes the overall test-set accuracy, while Constr. Accuracy corresponds to the performance metric used in Figure 4 (i.e., mean accuracy under test-time constraints). We report results for three configurations, namely, M1 (Apple M1 Pro), GTX (Nvidia GTX 1070), and RTX (Nvidia RTX 3060).
Software Dependencies No For RL experiments, we employ REINFORCE in both tabular and neural single-/multi-task settings [42]. For LLMs, we adapt Hugging Face s TRL library [43] using the RLOO trainer [44], an RL fine-tuning method equivalent to REINFORCE. This unified setup ensures consistency across all experiments. To conduct the experiments, we use Huggingface s Accelerate and Transformers libraries. For RL fine-tuning, we employ the RLOO trainer [44] from the TRL library, training with Deep Speed Ze RO Stage 2 [53]. The RLOO trainer is an adapted version of the PPO trainer; since REINFORCE is a special case of PPO, where the number of PPO epochs and the number of mini-batches are set to 1. This formulation is adopted in the TRL implementation. All training parameters and details are reported in Figure 11. Remaining hyperparameters are set to default values and kept consistent across all baselines. Target cost budgets for all tasks are defined as a percentage of the base model s original response length. Given the differing characteristics of models and datasets, we set the target cost budgets to 10% of the original response tokens for the QWEN model, and to 20% and 35% for the METAMATH model on SVAMP and GSM8K, respectively. Regarding the LLM deployment comparison, we evaluate inference metrics using Ollama [54] for both base and fine-tuned models across multiple hardware configurations. All models are converted to GGUF format.
Experiment Setup Yes RL experiments. We use the REINFORCE algorithm to train an MLP policy [42]. For BINARYTREE, the policy is tabular, while for PUDDLEGRID-SINGLE and PUDDLEGRID-MULTI, we use a neural policy with two hidden layers. Policies are updated using a batch size of five episodes and the Adam optimizer with a learning rate of 3e-4 across all environments. PUDDLEGRID-SINGLE and PUDDLEGRID-MULTI have a maximum of 200 steps per episode. Since PUDDLEGRID-MULTI is a multi-task environment, we generated 100 tasks by randomly positioning the agent and goal. To ensure the tasks are challenging, the agent and goal are placed on opposite sides of the lava squares. LLM experiments. As METAMATH-LLEMMA-7B serves as one of the base models in our LLM experiments, we adopt the prompt template (Figure 10), introduced by [50]. It corresponds to a default chain-of-thought prompt that encourages step-by-step reasoning. This prompting template is used across all methods during fine-tuning. Additionally, for both SVAMP [46] and GSM8K [47], we filter out prompts exceeding 512 tokens. To conduct the experiments, we use Huggingface s Accelerate and Transformers libraries. For RL fine-tuning, we employ the RLOO trainer [44] from the TRL library, training with Deep Speed Ze RO Stage 2 [53]. The RLOO trainer is an adapted version of the PPO trainer; since REINFORCE is a special case of PPO, where the number of PPO epochs and the number of mini-batches are set to 1. This formulation is adopted in the TRL implementation. All training parameters and details are reported in Figure 11. Figure 11: Training details and hyperparameters used across all baselines for SVAMP and GSM8K. Parameter SVAMP GSM8K Fine-tuning Method PEFT with LoRA Lo Ra α 96 Lo Ra r 96 Lo Ra Dropout 0.1 Target Modules all linear Per-device Train Batch Size 4 2 Per-device Eval Batch Size 4 Gradient Accumulation Steps 8 Local Rollout Forward Batch Size 8 Learning Rate 3e-6 KL Coefficient 0.0001 Max Response Length 350 512