A fast heuristic to optimize time-space tradeoff for large models
Authors: Akifumi Imanishi, Zijian Xu, Masayuki Takagi, Sixue Wang, Emilio Castillo
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We applied Fast SA to Py Torch models and verified its effectiveness through popular large vision and text models, including recent language models with the transformer architecture. The results demonstrate significant memory reductions by 73% with extra 18% computational overheads on average. Our experiments demonstrate the practicality and effectiveness of our recomputation algorithm, further highlighting its potential for wide application in various deep learning domains. |
| Researcher Affiliation | Industry | Akifumi Imanishi Preferred Networks imanishi@preferred.jp Zijian Xu Preferred Networks joe@preferred.jp Masayuki Takagi Preferred Networks mtakagi@preferred.jp Sixue Wang Preferred Networks cecilwang@preferred.jp Emilio Castillo Preferred Networks ecastill@preferred.jp |
| Pseudocode | Yes | Algorithm 1 Node grouping |
| Open Source Code | No | The paper states that 'Our algorithm was integrated into the Py Torch framework' and describes its implementation, but does not provide an explicit statement about releasing their source code or a link to a repository for the Fast SA algorithm. |
| Open Datasets | Yes | Our experiments involved the latest vision models and vision transformers obtained from timm (Py Torch Image Models), as well as text models (including language models) from Hugging Face transformers. and listing specific models from these sources with citations like 'LLa Ma [40].' |
| Dataset Splits | No | The paper does not specify training, validation, or test dataset splits for its own experiments. It focuses on optimizing the computational graph for existing models, which implies the dataset splitting would have been handled during the original training of those pre-trained models. |
| Hardware Specification | Yes | The proposed algorithm (Fast SA) and Checkmate LP were evaluated using a cluster system, with each instance configured with 8 CPU cores (Intel(R) Xeon(R) Platinum 8380 @ 2.30GHz) and 100 Gi B of RAM with a NVIDIA A100 80GB GPU. Although the PDLP solver fully uses the allocated resources, Fast SA only requires a single CPU. Due to Gurobi license constraints, Checkmate MILP was executed on another machine with 36 CPU cores (Intel(R) Xeon(R) Gold 6154 CPU @ 3.00 GHz) and 376 Gi B of RAM. |
| Software Dependencies | Yes | For our experiments, we used Py Torch 2.1.0.dev20230404... Gurobi [15] is used as the internal solver for Checkmate MILP together with PDLP solver, provided by OR-Tools [28] for Checkmate LP... The versions of timm and transformers are 0.9.1 and 4.28.1, respectively... compiled using GCC 10.4.0 with the O3 option. The compiled module was integrated with Py Torch 2.1.0.dev20230404 and the experiments were run using Python 3.9.12 with CUDA 11.8. |
| Experiment Setup | Yes | For memory budgets, we used the 50% and 25% values of the simulated initial peak memory... The first SA on grouped nodes for at most 20 million iterations... The second SA ran for a fixed 2 million iterations... The initial temperature is set as 0.1% of the initial objective value, and the final temperature is set as 0.01% of the initial temperature... For vision models shown in Table 3, we set batchsize as 512... For text models shown in Table 4, we used batchsize 128 and context length 512... For language models shown in Table 5, we used batchsize 8 and context length 2048... |