Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
GradMetaNet: An Equivariant Architecture for Learning on Gradients
Authors: Yoav Gelberg, Yam Eitan, Aviv Navon, Aviv Shamsian, Theo (Moe) Putterman, Michael Bronstein, Haggai Maron
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate Grad Meta Net on several gradient learning tasks, comparing to equivariant weight-space architectures and other natural baselines. First, we demonstrate Grad Meta Net s ability to predict local curvature information using a small sample of gradients, achieving a 26.3% improvement over standard approximations, and outperforming other learned approaches. We then integrate Grad Meta Net into learned optimizer architectures and apply it to train image classifiers and transformer language models, achieving up to a 4.63 reduction in steps compared to Adam, and a 1.78 improvement over other learned baselines. Finally, we use Grad Meta Net for model editing, where we improve on current state-of-the-art results in editing MNIST and CIFAR10 INRs by up to 22.5%. Across all tasks, Grad Meta Net consistently outperforms baselines, highlighting the value of efficient gradient representations and equivariant processing of sets of gradients. |
| Researcher Affiliation | Collaboration | Yoav Gelberg University of Oxford EMAIL Yam Eitan Technion EMAIL Aviv Navon Independent Reseracher Aviv Shamsian Bar-Ilan University Theo (Moe) Putterman UC Berkeley Michael Bronstein University of Oxford, AITHYRA Haggai Maron Technion/NVIDIA |
| Pseudocode | Yes | A.2 Extracting Activations and Pre-Activation Gradient Signals As mentioned in Section 3, the activations (a(l)) and pre-activation gradient signals (g(l)) used for the gradient decomposition are naturally computed during backpropagation and don t need to be recomputed. The following is a Py Torch code example for extracting these components without additional cost using forward/backward hooks: import torch import torch.nn as nn import torch.nn.functional as F class MLP(nn.Module): def __init__(self): super(MLP , self).__init__ () self.fc1 = nn.Linear (8, 32) self.fc2 = nn.Linear (32, 16) self.fc3 = nn.Linear (16, 3) def forward(self , x): x = F.relu(self.fc1(x)) x = F.relu(self.fc2(x)) x = self.fc3(x) return x activations = {} tangents = {} def forward_hook(module , inp , out): activations[module] = inp [0]. detach () def backward_hook(module , grad_inp , grad_out): tangents[module] = grad_out [0]. detach () model = MLP() # Set hooks model.fc1. register_forward_hook (forward_hook) model.fc1. register_full_backward_hook ( backward_hook ) model.fc2. register_forward_hook (forward_hook) model.fc2. register_full_backward_hook ( backward_hook ) model.fc3. register_forward_hook (forward_hook) model.fc3. register_full_backward_hook ( backward_hook ) # Backpropagate loss x = torch.randn (4, 8) # (batch , input) target = torch.randn (4, 3) # (batch , input) output = model(x) loss = F.mse_loss(output , target) loss.backward () print(activations) print(tangents) |
| Open Source Code | No | Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: We are currently in the process of cleaning and unifying the codebase, which includes different experiments implemented in different frameworks, using both JAX and Py Torch (to comply with baseline implementations), making the task more involved. We are committed to releasing a well-documented version of the code as soon as possible. |
| Open Datasets | Yes | Tasks. We use three types of optimization tasks: (1) optimizing a 2-parameter linear regression, constructed to have non-diagonal curvature, (2) optimizing MLPs for classifying CIFAR10, CIFAR100 [44], and Fashion MNIST [90] images, and (3) optimizing transformer-based language models on LM1B [14]. Data. Following previous works [38, 93, 94], we use two standard benchmarks: figure dilation for MNIST INRs and contrast enhancement for CIFAR-10 INRs. |
| Dataset Splits | Yes | Data. We first create a set of randomly initialized MLPs with 1-dimensional input and output. We then generate the targets by computing the FIM diagonal for each model over a sample of 1024 inputs in [ 1, 1]. The input to each baseline is a smaller gradient sample computed over 128 points sampled from [ 1, 1]. Data preparation. We use 500 examples as a test dataset and 500 examples as a validation dataset, with the size of the training set varying between 10 and 2000 examples. Meta-training details. The meta-training objective is training loss at the end of the inner training horizon T, which is T = 2, 000 for image classification tasks, T = 5, 000 for the transformer language modeling task, and T = 10 for the 2D linear regression experiment. Tasks. ...We train with a batch size of 8 on length 8 sequences. INR Editing Dataset. ...sample 64 random input coordinates in [0, 1]2 |
| Hardware Specification | Yes | In this section we provide all experimental details for all experiments in Section 7. We run all the experiments on a singel NVIDIA-A100-SXM4 GPU with 40GB of memory. Table 12: Update-time and memory comparison on a GPT-2 scale transformer using 2 A100-40GB. |
| Software Dependencies | No | PyTorch [66] is mentioned and a code example is given in Appendix A.2. Open CV [10] is also mentioned. However, specific version numbers for these software dependencies are not provided in the paper. |
| Experiment Setup | Yes | All models were trained for 100 epochs using the Adam optimizer with a learning rate of 1 10 3 and a batch size of 32. We meta-train for 50,000 steps using Adam [40] with learning rate 10 4, estimating meta-gradients over 16 parallel training runs using persistent evolutionary strategies (PES) [85] with a truncation length of 50. The meta-training objective is training loss at the end of the inner training horizon T, which is T = 2, 000 for image classification tasks, T = 5, 000 for the transformer language modeling task, and T = 10 for the 2D linear regression experiment. For all methods, we initialize ฮฑ = 0.1, ยต = 0.9 and ฮฒ = 0.001 before meta-training. We train all methods for 150K steps using the Adam W [50] optimizer with a batch size of 64. We search over learning rates in {0.01, 0.005, 0.0001}. |