Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SocioDojo: Building Lifelong Analytical Agents with Real-world Text and Time Series

Authors: Junyan Cheng, Peter Chin

ICLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We perform experiments and ablation studies to explore the factors that impact performance. The results show that our proposed method achieves improvements of 32.4% and 30.4% compared to the state-of-the-art method in the two experimental settings. [...] We also perform experiments and ablation studies to explore factors that impact performance. [...] 5 EXPERIMENT In Section 5.1, we present our experimental setup. We then evaluate our proposed H&P prompting and compare it with other state-of-the-art prompting techniques in Section 5.2. Finally, we discuss the results of the ablation studies in Section 5.3.
Researcher Affiliation	Academia	Junyan Cheng Thayer School of Engineering Dartmouth College Hanover, NH 03755, USA EMAIL Peter Chin Thayer School of Engineering Dartmouth College Hanover, NH 03755, USA EMAIL
Pseudocode	No	The paper describes the agent's architecture and prompting process in detail but does not include any sections explicitly labeled 'Pseudocode' or 'Algorithm'.
Open Source Code	Yes	Our code and data are available at https://github.com/chengjunyan1/Socio Dojo.
Open Datasets	Yes	Our code and data are available at https://github.com/chengjunyan1/Socio Dojo. [...] Socio Dojo uses three components Information Sources, Time series, and Knowledge base & Tools based on 30 GB of high-quality real-world data that we have collected [...] We collect time series on a variety of topics for a comprehensive probe of the world state, including financial data from Yahoo Finance, economic time series from the St. Louis Federal Reserve Economic Data Database (FRED) with a popularity rating of more than 50%, Google trends of free trending keywords from Exploding Topics, a society trend tracking service, political polls from Five Thirty Eight, a famous political analysis website, and public opinion poll trackers from You Gov, an online survey platform.
Dataset Splits	No	The paper describes a lifelong learning environment where agents are 'constantly updated with the latest messages' and evaluated based on performance 'over time' (e.g., 'The game begins on 2021-10-01 and ends on 2023-08-01'). This indicates a continuous, time-based evaluation rather than traditional, fixed training/validation/test dataset splits.
Hardware Specification	No	The paper does not specify the hardware (e.g., specific GPU or CPU models, memory, or cloud instances) used to run the experiments.
Software Dependencies	No	The paper mentions 'Chroma DB' and 'Instructor-XL (Su et al., 2023)' as components and 'GPT-3.5-Turbo series' and 'GPT-4' as foundation models. However, it does not provide specific version numbers for these software dependencies (e.g., Chroma DB vX.Y.Z or Instructor-XL vA.B).
Experiment Setup	Yes	In our experiment, we set this number [max steps for analyst] to 4. [...] It can call one of the query interfaces to find resources at each step. The query is handled iteratively through a multi-round dialog response loop with a max step of 3 in our experiment. [...] It initiates a multi-round dialog action loop with a max step of 5 in our experiment when an analysis report is received. [...] All methods use AAA architecture, one news channel as the information source, GPT-3.5-Turbo series as foundation models with a low temprature=0.2 for a more deterministic experiment result. [...] We chose a bound of 5, as it is possible to achieve a return of 5 times for a single asset without leverage in around 2 years, which is the time span of Socio Dojo. [...] Therefore, we selected overnight rates of FRD: 0.10, WEB: 0.05, and a bound of 5 as our experimental setting.