Rethinking skip connection model as a learnable Markov chain
Authors: Dengsheng Chen, Jie Hu, Wenwen Qiang, Xiaoming Wei, Enhua Wu
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The encouraging experimental results in multi-modal translation and image recognition empirically confirm our conjecture of the learnable Markov chain view and demonstrate the superiority of the proposed penal connection. |
| Researcher Affiliation | Collaboration | Dengsheng Chen Meituan chendengsheng@meituan.com Jie Hu State Key Laboratory of Computer Science, ISCAS University of Chinese Academy of Sciences hujie@ios.ac.cn Wenwen Qiang University of Chinese Academy of Sciences Institute of Software Chinese Academy of Sciences wenwen2018@iscas.ac.cn Xiaoming Wei Meituan weixiaoming@meituan.com Enhua Wu State Key Laboratory of Computer Science, ISCAS University of Chinese Academy of Sciences University of Macau ehwu@um.edu.mo |
| Pseudocode | Yes | Algorithm 1 Pseudo code of penal connection in a Py Torch-like style. zl = fθl(xl 1) # Only following line is added to register a hook zl.register hook(lambda gxl, zl=zl.detach().clone(): gxl + τzl) xl = xl 1 + zl |
| Open Source Code | No | The paper describes how the method can be implemented with one line of code and provides pseudocode (Algorithm 1), but there is no explicit statement or link indicating that the source code for the methodology is openly available or has been released. |
| Open Datasets | Yes | Dataset. WMT16 (Bojar et al., 2016) is a widely used translation dataset based on the data from statmt.org, which contains various interesting translation tasks on specified domains. Here, we are focusing on the news translation tasks between English and Germany. The text for all the test sets is drawn from news articles. (...) Dataset. We conduct a series of experiments on the task of image classification using the Image Net1K dataset (Deng et al., 2009), which consists of 1.28 million training images across 1000 classes and 50k images for validation. (...) Dataset. CIFAR10 2 dataset consists of 60k 32x32 color images in 10 classes. There are 50k training images and 10k test images. The CIFAR100 dataset is identical to CIFAR-10, except the number of classes is 100. |
| Dataset Splits | Yes | Dataset. We conduct a series of experiments on the task of image classification using the Image Net1K dataset (Deng et al., 2009), which consists of 1.28 million training images across 1000 classes and 50k images for validation. |
| Hardware Specification | Yes | All experiments are carried out by publicly available projects implemented by Py Torch (Paszke et al., 2017) on a device equipped with 8 NVIDIA-A100 GPUs. |
| Software Dependencies | No | The paper mentions using 'Py Torch (Paszke et al., 2017)' but does not specify the version number of PyTorch or any other software dependencies. |
| Experiment Setup | Yes | Implementation details. We adopt the most widely used benchmark Transformer (Vaswani et al., 2017) as our strong baseline. The embedding size demb is 512, and the source and target embedding layers are shared. We empirically set τ to 3 10 4, which generalizes well to all translation tasks. Here, we opt for the mutual translation tasks between English and Germany for validation. All the models are trained using Adam optimizer with β1 = 0.9, β2 = 0.98. We use a batch size of 1024 and weight decay of 0.05 and other recipes for training are identical to the original implementation (Vaswani et al., 2017). We set up two regular training settings, Q and S separately (see Table 1). |