Adaptive Checkpoint Adjoint Method for Gradient Estimation in Neural ODE
Authors: Juntang Zhuang, Nicha Dvornek, Xiaoxiao Li, Sekhar Tatikonda, Xenophon Papademetris, James Duncan
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The empirical performance of neural ordinary differential equations (NODEs) is significantly inferior to discrete-layer models on benchmark tasks (e.g. image classification). We demonstrate an explanation is the inaccuracy of existing gradient estimation methods: the adjoint method has numerical errors in reverse-mode integration; the naive method suffers from a redundantly deep computation graph. We propose the Adaptive Checkpoint Adjoint (ACA) method: ACA applies a trajectory checkpoint strategy which records the forwardmode trajectory as the reverse-mode trajectory to guarantee accuracy; ACA deletes redundant components for shallow computation graphs; and ACA supports adaptive solvers. On image classification tasks, compared with the adjoint and naive method, ACA achieves half the error rate in half the training time; NODE trained with ACA outperforms Res Net in both accuracy and test-retest reliability. On time-series modeling, ACA outperforms competing methods. |
| Researcher Affiliation | Academia | 1Department of Biomedical Engineering, Yale University, New Haven, CT USA 2Department of Radiology & Biomedical Imaging, Yale School of Medicine, New Haven, CT USA 3Department of Statistics and Data Science, Yale University, New Haven, CT USA 4Department of Electrical Engineering, Yale University, New Haven, CT USA. |
| Pseudocode | Yes | Algorithm 1 Numerical Integration; Algorithm 2 ACA: Record z(t) with Minimal Memory |
| Open Source Code | Yes | We provide the PyTorch implementation of ACA: https: //github.com/juntang-zhuang/torch-ACA. |
| Open Datasets | Yes | We trained the same NODE structure to perform image classification on the CIFAR10 dataset using different gradient estimation methods. ...On both CIFAR10 and CIFAR100 datasets... We validate our method on the Mujoco dataset (Tassa et al., 2018). |
| Dataset Splits | No | The paper mentions using CIFAR10 and CIFAR100 datasets and evaluating on a 'test set' but does not specify the exact training, validation, and test split percentages or sample counts for reproduction. |
| Hardware Specification | Yes | To train for 90 epochs on a single GTX 1080Ti GPU, ACA takes about 9 hours, while the adjoint method takes about 18 hours, and the naive method takes more than 30 hours. |
| Software Dependencies | No | The paper mentions 'PyTorch implementation of ACA' but does not specify exact version numbers for PyTorch or any other software libraries. |
| Experiment Setup | Yes | The relative and absolute error tolerance are set as 1e-5 for the adjoint and naive method, with Dopri5 solver implemented by Chen et al. (2018). All methods are trained with SGD optimizer. For each method, we perform 3 runs and record the mean and variance of test accuracy varying with training process. All models are trained for 90 epochs, with initial learning rate of 0.01, and decayed by a factor of 0.1 at epoch 30 and 60. The adjoint method and ACA use a batchsize of 128, while the naive method uses a batchsize of 32 due to its large memory cost. |