Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Novel Exploration via Orthogonality

Authors: Andreas Theophilou, Ozgur Simsek

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In an empirical evaluation in online, incremental settings, NEO outperformed related state-of-the-art approaches, including eigen-options and cover options, in a large collection of undirected and directed environments with varying connectivity structures.
Researcher Affiliation Academia Andreas Theophilou University of Bath United Kingdom EMAIL Özgür Sim sek University of Bath United Kingdom EMAIL
Pseudocode Yes We call the proposed approach Novel Exploration via Orthogonality, or NEO, and outline it in pseudocode in Algorithm 1. Algorithm 1 NEO
Open Source Code Yes All code is available at https://github.com/Andreas Theo/NEO
Open Datasets No The paper describes various environments (e.g., directed four rooms, Rubik's cube end game, large maze, NYC street graph) used for empirical evaluation. While the code to run these environments is open-source, the paper does not explicitly provide concrete access information (link, DOI, citation) for a publicly available dataset in the form of collected data.
Dataset Splits No This paper describes a Reinforcement Learning setup where agents learn online in various environments. It specifies how start and goal states are chosen for different agent instances and runs: 'For each evaluation agent instance, a goal state is randomly chosen and the farthest state from the goal state is set to be the start state for all runs for the agent instance. For each method, we run 20 agent instances; we use a random seed (equal to run number) before selecting goal and start states, ensuring all start and goal states are the same across compared agents.' This describes the experimental setup and initialization for RL tasks rather than traditional training/test/validation dataset splits.
Hardware Specification Yes Hardware. All results are obtained using a 9900k CPU, 16GB of ram, a 256 SSD hard drive and a 2080Ti NVIDIA GPU. Results are obtained in hours on a local computer.
Software Dependencies No The paper does not explicitly list specific software dependencies (e.g., programming languages, libraries, or frameworks) along with their version numbers required to replicate the experiment.
Experiment Setup Yes We use Q-learning with step size α = 0.4, discount rate γ = 0.99, an ϵ-greedy policy with ϵ = 0.1, augmented with a growing library of temporally extended actions in the form of options. The initial Q values are set to 0 for primitive actions and 0.00001 for options... When the ϵ-greedy policy takes an exploration step, with probability init P the agent chooses a random option, otherwise a random primitive action. We fix init P at 0.1 for reward based evaluations... Options are discovered online every H decision stages, with H set to five times the number of nodes in the domain s full transition graph. We generate four new options per update, capping the total stored options at 64... Parameter settings: Epoch length 2,000 steps, Option initialization probability (initp) {0.1, 0.2, 0.3, 0.4, 0.5}, Update horizon (H) 5 number of nodes in the domain, Learning rate (α) 0.4, Discount factor (γ) 0.99, Epsilon (ε) 0.1, Instances per agent 20.