Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Crucible: Quantifying the Potential of Control Algorithms through LLM Agents

Authors: Lianchen Jia, Chaoyang Li, Qian Houde, Tianchi Huang, Jiangchuan Liu, Lifeng Sun

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate Crucible s effectiveness across a wide spectrum of case studies, from classic control tasks to complex computer systems, and validate its findings in a real-world deployment. Our experimental results reveal that Crucible systematically quantifies the tunable space across different algorithms. Furthermore, Crucible provides a new dimension for algorithm analysis and design, which ultimately leads to performance improvements.
Researcher Affiliation	Academia	Lianchen Jia1, Chaoyang Li1, Houde Qian1, Tianchi Huang1, Jiangchuan Liu2, Lifeng Sun1,3 1Department of Computer Science and Technology, Tsinghua University, 2Simon Fraser University, 3BNRist
Pseudocode	Yes	Appendix B provides a detailed pseudocode of the complete Crucible workflow.
Open Source Code	Yes	Our code is available at https://github.com/thu-media/Crucible.
Open Datasets	Yes	We utilize a widely adopted adaptive bitrate (ABR) simulator [32, 29] with four public network datasets: Oboe [30], FCC [49], 3G [50], and Puffer [51]. The experiments evaluate prominent algorithms, including BBA, MPC, HYB, BOLA, and Pitree, using the Envivio video trace [32] as standardized content. We use the Cart Pole-v1 environment from Gym [45], a standard benchmark in control task. For the input workload, we use 2GB of data consisting of 10 TPC-H standard query tasks [54].
Dataset Splits	No	The paper describes testing on various 'test environments' and 'public network datasets' (Oboe, FCC, 3G, Puffer) and 'TPC-H standard query tasks'. However, it does not specify explicit training, validation, or test dataset splits (e.g., percentages or sample counts) for these environments or any data used by the LLM agent to learn or tune algorithms.
Hardware Specification	No	This research mainly employs API calls to the Claude 3.7 Sonnet [44], with Bayesian optimization serving as a hyperparameter tuning tool. This indicates the use of an API service, not specific local hardware (CPU, GPU models, memory) used for running the experiments or the Bayesian optimization.
Software Dependencies	No	This research mainly employs API calls to the Claude 3.7 Sonnet [44], with Bayesian optimization serving as a hyperparameter tuning tool. The paper mentions using 'Claude 3.7 Sonnet' (an API) and 'Bayesian optimization' (a technique), but does not provide specific version numbers for any software libraries or tools used for the Bayesian optimization or other parts of the system beyond the LLM API name.
Experiment Setup	Yes	This research mainly employs API calls to the Claude 3.7 Sonnet [44], with Bayesian optimization serving as a hyperparameter tuning tool. For simulating developers with varying optimization capabilities, we configure the Bayesian iteration count at three distinct levels (0, 10, and 20 iterations) while setting the reflection iteration steps to 1, 2, and 3, respectively. The video is 193 seconds in length and segmented into 4-second chunks with bitrates of {300, 750, 1200, 1850, 2850, 4300} kbps. Specifically, in our implementation, we set µ to 4.3.