Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Searching for Programmatic Policies in Semantic Spaces

Authors: Rubens O. Moraes, Levi H. S. Lelis

IJCAI 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluated our hypothesis in a real-time strategy game called Micro RTS. Empirical results support our hypothesis that searching in semantic spaces can be more sample-efficient than searching in syntaxbased spaces.
Researcher Affiliation	Academia	Rubens O. Moraes1 and Levi H. S. Lelis2,3 1 Departamento de Inform atica, Universidade Federal de Vic osa, Brazil 2Department of Computing Science, University of Alberta, Canada 3Alberta Machine Intelligence Institute (Amii) EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 Library Construction
Open Source Code	Yes	The implementation of our system is available online at https: //github.com/rubensolv/Library-Induced-Semantic-Spaces
Open Datasets	Yes	We evaluate LISS using the Micro RTS domain, a real-time strategy game designed for research. There is an active research community that uses Micro RTS as a benchmark to evaluate intelligent systems.2 (Footnote 2: https://github.com/Farama-Foundation/Micro RTS/wiki)
Dataset Splits	No	No specific numerical splits (e.g., percentages, exact counts) for training, validation, and testing datasets were provided. The paper refers to different MDPs (Ptrain, Ptest) rather than data splits of a single dataset.
Hardware Specification	Yes	We used a dedicated number of computers with the following settings: 16 GB of RAM, i7-1165G7 CPUs at 2.80 GHz with 8 threads.
Software Dependencies	No	The paper mentions 'Microlanguage' and 'Python' but does not specify version numbers for programming languages, libraries, or frameworks used in the experiments.
Experiment Setup	Yes	We use k = 1000 in Nk and a limit of 400 seconds for SHC to return a best response; once it reaches this time limit, it returns the best policy it encountered across all restarts of the search. We use ϵ = 0.20 for mixing the syntax and semantic spaces. We also use z = 4 in all our experiments.