Deep Equilibrium Algorithmic Reasoning
Authors: Dobrik Georgiev, Joseph Wilson, Davide Buffelli, Pietro Lió
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical evidence, leveraging algorithms from the CLRS-30 benchmark, validates that one can train a network to solve algorithmic problems by directly finding the equilibrium. 5 Evaluation Setup For each algorithm we generated 105/100/100-sized training/validation/test datasets. |
| Researcher Affiliation | Collaboration | Dobrik Georgiev University of Cambridge dgg30@cam.ac.uk JJ Wilson Independent Researcher josephjwilson74@gmail.com Davide Buffelli Media Tek Research davide.buffelli@mtkresearch.com Pietro Liò University of Cambridge pl219@cam.ac.uk |
| Pseudocode | Yes | Listing 1: BFS algorithm. Clearly implemented as while b do c. |
| Open Source Code | Yes | 0Source code available here: https://github.com/Hekpo Ma H/DEAR |
| Open Datasets | Yes | Our empirical evidence, leveraging algorithms from the CLRS-30 benchmark, validates that one can train a network to solve algorithmic problems by directly finding the equilibrium. |
| Dataset Splits | Yes | For each algorithm we generated 105/100/100-sized training/validation/test datasets. Training sample sizes vary between 8 and 16 elements (uniformly randomly chosen) validation samples are of size 16. |
| Hardware Specification | Yes | If run on a single 4090 GPU one would need about 3 weeks of total compute. |
| Software Dependencies | Yes | The final node embeddings, from which the output is decoded, are the solution to the equation: H( ) = PUE(H( )) (4) where PUE(H( )) = P(H( ), U, E), U/E are the encoded node and edge feature matrices and P is the processor function. H(t) are the stacked latent states of the nodes at timestep t (with H(0) = 0). The above Equation 4 matches the signature of Equation 2, and can be solved via root-finding (we employ torchdeq [19]; MIT License), as if it is fθ of a DEQ. Our differences are mostly required by software engineering rather than research, hence they live here. Differences are: Different DL framework (Pytorch [36]) |
| Experiment Setup | Yes | In our experiments the models have a latent dimensionality of 128, the batch size is 32, the learning rate is 3 10 4 and we use the Adam optimizer [31]. We train our algorithmic reasoners for 100 epochs, choosing the model with the lowest task validation loss (discounting any regularisation; focusing on performance only) for testing. |