Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Vision-LSTM: xLSTM as Generic Vision Backbone
Authors: Benedikt Alkin, Maximilian Beck, Korbinian Pรถppel, Sepp Hochreiter, Johannes Brandstetter
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We pre-train models on Image Net-1K (Deng et al., 2009), which contains 1.3M training images and 50K validation images where each image belongs to one of 1000 classes. Vi L models are trained for 800 epochs (tiny) or 400 epochs (small, base) on 192x192 resolution with a learning rate of 1e-3 using a cosine decay schedule. Afterwards, the model is fine-tuned on 224x224 resolution for 20 epochs using a learning rate of 1e-5. Detailed hyperparameters can be found in Appendix Table 10. We then transfer the pre-trained models to serveral benchmark tasks: Image Net-1K classification on the validation set, ADE20K (Zhou et al., 2019) semantic segmentation and VTAB-1K (Zhai et al., 2019) classification. These benchmarks evaluate global image understanding (Image Net-1K), semantic local and global understanding (ADE20K) and few-shot generalization to a diverse set of 19 VTAB-1K classification datasets, which include natural images, specialized imagery (medical and satellite) and structured tasks (camera angle prediction, depth estimation, object counting, ...). We ablate various design choices of Vi L by training Vi L-T models for 100 epochs on Image Net1K in 224x224 resolution, other hyperparameters follow the ones from Section 3 (see also Appendix B.3). |
| Researcher Affiliation | Collaboration | Benedikt Alkin1,2 Maximilian Beck1,3 Korbinian P oppel1,3 Sepp Hochreiter1,2,3 Johannes Brandstetter1,2 1ELLIS Unit Linz, Institute for Machine Learning, JKU Linz, Austria 2Emmi AI Gmb H, Linz, Austria 3NXAI Gmb H, Linz, Austria EMAIL |
| Pseudocode | No | The paper describes the m LSTM forward pass using mathematical equations (1)-(12) in Section 2.1, but does not present it as a structured pseudocode or algorithm block. |
| Open Source Code | No | Project page: https://nx-ai.github.io/vision-lstm/ |
| Open Datasets | Yes | We pre-train models on Image Net-1K (Deng et al., 2009), which contains 1.3M training images and 50K validation images where each image belongs to one of 1000 classes. We then transfer the pre-trained models to serveral benchmark tasks: Image Net-1K classification on the validation set, ADE20K (Zhou et al., 2019) semantic segmentation and VTAB-1K (Zhai et al., 2019) classification. |
| Dataset Splits | Yes | We pre-train models on Image Net-1K (Deng et al., 2009), which contains 1.3M training images and 50K validation images where each image belongs to one of 1000 classes. For fine-tuning models on VTAB-1K we provide the hyperparameters in Table 11. We search for the best learning rate for each dataset by fine-tuning the model 25 times (5 learning rates with 5 seeds each) on the 800 training samples and evaluating them on the 200 validation samples. With the best learning rate, we then train each model 5 times on concatenation of training and validation split, evaluate on the test split and report the average accuracy. |
| Hardware Specification | Yes | We train models on servers with either 8x A100 or 4x A100 nodes. Runtimes denote the training time for 10 Image Net-1K epochs and are extrapolated from short benchmark runs on a single A100-80GB-PCIe using float16 precision and 224x224 images. |
| Software Dependencies | No | This limitation comes from the current lack of optimized hardware implementations of the m LSTM (e.g., CUDA kernels) where we instead rely on torch.compile, a generic speed optimization method from Py Torch (Paszke et al., 2019), to optimize computations. However, implementing fast compute kernels in CUDA (NVIDIA et al., 2020) or Triton (Tillet et al., 2019) is highly non-trivial as it requires expert hardware architecture knowledge, advanced implementation skills and potentially multiple development cycles to iron out numerical inaccuracies or instabilities. |
| Experiment Setup | Yes | Vi L models are trained for 800 epochs (tiny) or 400 epochs (small, base) on 192x192 resolution with a learning rate of 1e-3 using a cosine decay schedule. Afterwards, the model is fine-tuned on 224x224 resolution for 20 epochs using a learning rate of 1e-5. Detailed hyperparameters can be found in Appendix Table 10. Table 10 shows detailed hyperparameters used to train Vi L models. Table 11: Hyperparameters for fine-tuning on VTAB-1K. Table 12: Hyperparameters for fine-tuning on ADE20K. |