DataMUX: Data Multiplexing for Neural Networks
Authors: Vishvak Murahari, Carlos Jimenez, Runzhe Yang, Karthik Narasimhan
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we introduce data multiplexing (Data MUX), a technique that enables deep neural networks to process multiple inputs simultaneously using a single compact representation. Data MUX demonstrates that neural networks are capable of generating accurate predictions over mixtures of inputs, resulting in increased inference throughput with minimal extra memory requirements... We show the viability of Data MUX for different architectures (Transformers, and to a much lesser extent MLPs and CNNs) across six different tasks spanning sentence classification, named entity recognition and image classification. |
| Researcher Affiliation | Academia | Vishvak Murahari Department of Computer Science Princeton University murahari@princeton.edu Carlos E. Jimenez Department of Computer Science Princeton University carlosej@princeton.edu Runzhe Yang Department of Computer Science Princeton University runzhey@princeton.edu Karthik Narasimhan Department of Computer Science Princeton University karthikn@princeton.edu |
| Pseudocode | No | The paper describes the methods using text and mathematical equations but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github.com/princeton-nlp/Data MUX |
| Open Datasets | Yes | We evaluate our models and the baselines on two types of text classification tasks: 1. Token-level classification: ... Co NLL-2003 Named Entity Recognition (NER) task Sang and Meulder (2003). 2. Sentence-level classification: ... GLUE benchmark Wang et al. (2019): ... SST-2 Socher et al. (2013), ... QQP 2, and the natural language inference tasks MNLI Williams et al. (2018) and QNLI Wang et al. (2019); Rajpurkar et al. (2016). The T-MUX models are all pre-trained using the retrieval warm-up on the Wikitext-103 dataset Merity et al. (2017). |
| Dataset Splits | Yes | For all tasks, we use the standard train/validation/test splits provided with each dataset. |
| Hardware Specification | Yes | We conducted all experiments on NVIDIA V100 GPUs (32GB) on a shared cluster. |
| Software Dependencies | No | The paper mentions using the Huggingface Wolf et al. (2019) framework but does not specify exact version numbers for any software dependencies. |
| Experiment Setup | Yes | In addition, we also continue to use the retrieval task as an auxiliary objective during task training. The total loss is a combination of the task loss and retrieval loss (we use = 0.1 in our experiments): L = (1 )LTask + LRetrieval, (4). For the Transformer models, we train for 10 epochs using a learning rate of 1e-4 with a linear decay schedule and a warm-up of 10% of the training steps using the AdamW optimizer with a batch size of 32. For the MLP and CNN models, we used the Adam optimizer with a learning rate of 1e-3 and a batch size of 128. |