Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
A Model of Place Field Reorganization During Reward Maximization
Authors: M Ganesh Kumar, Blake Bordelon, Jacob A Zavatone-Veth, Cengiz Pehlevan
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We develop a normative framework using a reward maximization objective, whereby the temporal difference (TD) error drives place field reorganization to improve policy learning. Place fields are modeled using Gaussian radial basis functions to represent states in an environment, and directly synapse to an actor-critic for policy learning. Each field s amplitude, center, and width, as well as downstream weights, are updated online at each time step to maximize rewards. We demonstrate that this framework unifies three disparate phenomena observed in navigation experiments. Furthermore, we show that these place field phenomena improve policy convergence when learning to navigate to a single target and relearning multiple new targets. |
| Researcher Affiliation | Academia | 1The John A. Paulson School of Engineering and Applied Sciences, Harvard University 2Center for Brain Science, Harvard University 3The Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University 4Society of Fellows, Harvard University. Correspondence to: M Ganesh Kumar <EMAIL>. |
| Pseudocode | No | The paper describes update rules and equations (e.g., Equations 47, 48, 49, 50, 51, 52) within the appendix under 'A. Details of the Place field-based navigation model' and 'A.5. Online update of place field and actor-critic parameters'. However, these are presented as mathematical expressions or descriptive steps, not as structured pseudocode blocks with explicit 'Algorithm' or 'Pseudocode' labels or a code-like format. |
| Open Source Code | Yes | Code Availability The code for our agents and to reproduce all figures in this paper is available at: https://github.com/Pehlevan-Group/placefield_reorg_agent |
| Open Datasets | No | The paper describes a simulated environment (1D track or 2D arena) where agents learn to navigate. It does not use or provide access to external, pre-existing publicly available datasets. The data for experiments are generated within this simulated environment as the model learns. |
| Dataset Splits | No | The paper describes the simulated environment and experimental trials, such as 'The trial is terminated when either the maximum trial time is reached Tmax or when the total reward achieved PT t=0 rt reaches a threshold Rmax.' and 'in a finite-horizon setting, modeling the trial structure in neuroscience experiments... after a maximum of 100 steps in the 1D track and 300 steps in the 2D arena.' This describes the simulation's termination conditions and structure for individual runs (trials) but does not involve splitting an external dataset into training, testing, or validation sets. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, processors, or memory used for running the experiments. It focuses on the model and simulation results. |
| Software Dependencies | No | The paper mentions 'most optimizers e.g. in Tensorflow, Py Torch' in Appendix A.4, indicating the potential use of these frameworks. However, it does not specify any version numbers for these or other key software components, which is required for a reproducible description. |
| Experiment Setup | Yes | To model an animal s reward maximization performance during navigational learning we compute the cumulative discounted reward (G = PT t=0 PT k=0 γkrt+1+k) for the entire trajectory for each trial using γ = 0.9 as the discount factor... gt+1 = (1 αenv) gt +αenvvmaxgt . (1) ... using a constant αenv = 0.2 after scaling for maximum displacement using vmax = 0.1... The critic linearly weights place field activity using a vector wv i to estimate the value of the current location: v(xt) = PN i wv i ϕi(xt)... wv i and W π ji were initialized by sampling from a normal distribution N(0, 10 5)... The learning rates for the actor-critic and place field parameters can be the same (Fig. S13)... In simulations, we use η = 0.01 and ηθ = 0.0001. |