Privacy-Preserving Video Classification with Convolutional Neural Networks
Authors: Sikha Pentyala, Rafael Dowsley, Martine De Cock
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our proposed solution in an application for private human emotion recognition. Our results across a variety of security settings, spanning honest and dishonest majority configurations of the computing parties, and for both passive and active adversaries, demonstrate that videos can be classified with state-of-the-art accuracy, and without leaking sensitive user information. |
| Researcher Affiliation | Academia | 1School of Engineering and Technology, University of Washington, Tacoma, WA, USA 2Faculty of Information Technology, Monash University, Clayton, Australia 3Dept. of Appl. Math., Computer Science and Statistics, Ghent University, Ghent, Belgium. |
| Pseudocode | Yes | Protocol 1 Protocol πFSELECT for oblivious frame selection Input: A secret shared 4D-array [[A]] of size N h w c with the frames of a video; a secret shared frame selection matrix [[B]] of size n N. The values N, h, w, c, n are known to all parties. Output: A secret shared 4D-array F of size n h w c holding the selected frames |
| Open Source Code | No | The paper states 'We implemented the protocols from Sec. 4 in the MPC framework MP-SPDZ (Keller, 2020)' but does not provide concrete access to their specific implementation code for the methodology described. |
| Open Datasets | Yes | We use 1,248 video-only files with speech modality from this dataset, corresponding to 7 different emotions, namely neutral (96), happy (192), sad (192), angry (192), fearful (192), disgust (192), and surprised (192). The videos in the RAVDESS dataset have a duration of 3-5 seconds with 30 frames per second, hence the total number of frames per video is in the range of 120-150. We split the data into 1,116 videos for training and 132 videos for testing. |
| Dataset Splits | No | The paper states 'We split the data into 1,116 videos for training and 132 videos for testing' and describes how the test set was formed. While it mentions 'early-stopping' which implies use of a validation set, it does not provide specific split information (percentages, counts, or explicit standard splits) for a validation set. |
| Hardware Specification | Yes | We implemented the protocols from Sec. 4 in the MPC framework MP-SPDZ (Keller, 2020), and ran experiments on co-located F32s V2 Azure virtual machines. Each of the parties (servers) ran on separate VM instances (connected with a Gigabit Ethernet network), which means that the results in the tables cover communication time in addition to computation time. A F32s V2 virtual machine contains 32 cores, 64 Gi B of memory, and network bandwidth of upto 14 Gb/s. |
| Software Dependencies | No | The paper mentions using 'the MPC framework MP-SPDZ (Keller, 2020)', 'Open CV (Bradski & Kaehler, 2008)', and 'Keras (Chollet et al., 2015)' but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | Our video classifier samples every 15th frame, classifies it with the above Conv Net, and assigns as the final class label the label that has the highest average probability across all frames in the video. ... For Bob s image classification model, we trained a Conv Net with 1.48 million parameters with an architecture of [(CONV-RELU)-POOL]-[(CONV-RELU)*2-POOL]*2-[FC-RELU]*2-[FC-SOFTMAX]. We pre-trained the feature layers on the FER 2013 data to learn to extract facial features for emotion recognition, and fine-tuned the model on the RAVDESS training data. ... With early stopping using a batch size of 256 and Adam optimizer with default parameters in Keras (Chollet et al., 2015). ... With early-stopping using a batch size of 64 and SGD optimizer with a learning rate 0.001, decay as 10 6, and momentum as 0.9. ... For the ring Z2k, we used value k = 64. |