Hardening Deep Neural Networks via Adversarial Model Cascades

Deep neural networks (DNNs) are vulnerable to malicious inputs crafted by an adversary to produce erroneous outputs. Works on securing neural networks against adversarial examples achieve high empirical robustness on simple datasets such as MNIST. However, these techniques are inadequate when empirically tested on complex data sets such as CIFAR10 and SVHN. Further, existing techniques are designed to target specific attacks and fail to generalize across attacks. We propose Adversarial Model Cascades (AMC) as a way to tackle the above inadequacies. Our approach trains a cascade of models sequentially where each model is optimized to be robust towards a mixture of multiple attacks. Ultimately, it yields a single model which is secure against a wide range of attacks; namely FGSM, Elastic, Virtual Adversarial Perturbations and Madry. On an average, AMC increases the model’s empirical robustness against various attacks simultaneously, by a significant margin (of 6.225% for MNIST, 5.075% for SVHN and 2.65% for CIFAR10). At the same time, the model’s performance on non-adversarial inputs is comparable to the state-of-the-art models.

Threat Model

Consider a Deep Neural Network f(.;θ) and a data distribution D from which the set of samples used to train this network are drawn. Consider a clean (without any adversarial noise) sample (x, y) ~ D’, where D’ is a proxy for an unknown data distribution D. An adversary tries to create a malicious sample x_adv by adding a small perturbation to x such that x and x_adv are close according to some distance metric.

For our analyses, we consider two threat-model settings: white-box and adaptive black-box. A white-box adversary is one that has access to the target model’s weights and the training dataset. An adaptive black-box adversary is one that interacts with the target model only through its predictive interface. This adaptive black-box adversary trains a proxy model using samples from D’ and obtaining labels via predictions from the target model f(.;θ) In this setting, the adversary crafts adversarial examples x_adv on the proxy model using white-box attack strategies and uses these malicious examples to try and fool the target model f(.;θ) .

We consider four attacks in our research: FGSM, VAP, EAP and PGM.

AMC: Adversarial Model Cascades

Inspired by the observation that adversarial examples transfer between defended models, we propose Adversarial Model Cascades(AMC): training a cascade of models by injecting examples crafted from a local proxy model (or the target model itself). The cascade trains a stack of models built sequentially, where each model in the cascade is more robust than the one before. The key principle of our approach is that each model in the cascade is optimized to be secure against a combination of the attacks under consideration, along with the attacks the model has encountered during the previous iteration. Knowledge from the previous model is leveraged via parameter transfer while securing the model against subsequent attacks. This technique increases the robustness of the next layer of the cascade, which ultimately yields a model which is robust to all the attacks it has been hardened against via the algorithm.

A high level overview of the *AMC* algorithm.

Although AMC (the version described above) achieves a significant increase in robustness against many attacks, it does not guarantee an increase in robustness against all of the attacks which it has seen in the past. This condition is especially true for attacks which were seen during the initial iterations of the algorithm. To mitigate this problem, during the later iterations we also need to make the model remember the adversarial examples generated by attacks which were seen by it during its initial iterations. Thus, while constructing adversarial data per batch: instead of generating all perturbed data using the current attack, the algorithm also uses attacks it has seen so far into its run. This process implicitly weighs the attacks, as the model ends up seeing more samples from the attacks which it encounters during earlier iterations. As a result, the loss function to compute its gradient (at a given level E while running the cascade) becomes:

Where λ_i are hyper-parameters, such that ∀i,λi∈[0,1] and they all sum up to 1. With the above scheme, we observe that the resultant model remembers adversarial examples from attacks which were introduced during the initial iterations as well as recent ones, thus yielding better overall robustness.

Feature Squeezing

In addition to the above, we also investigate the suitability of Feature Squeezing as a pre-processing technique in the pipeline for improving the empirical robustness of DNNs. When used in conjunction with our framework, in contrast, we observe that Feature Squeezing (quantization in particular) can effectively improve the accuracy against stronger attacks.

Experimental Results

Error rates for various defenses against white-box adversarial attacks.

Error rates for various defenses against black-box adversarial attacks.

We observe that AMC, on an average, gives higher empirical robustness (accuracy on adversarial examples). Average robustness here refers to the robustness against all four attacks averaged together. We also observed that models obtained via adversarial hardening against only one kind of attack did not improve robustness against other attacks; whereas our models are robust against all the attacks we considered. Additionally, we outperform the technique by Wong by a margin of 2% for MNIST, 12% for SVHN and 7% for CIFAR-10, on an average across attacks (white-box).

Comparison with Ensemble Adversarial Training (EAT)

Ensemble Adversarial Training (EAT)also adopt an ensemble-based technique for increasing robustness. A trivial extension to EAT for handling multiple attacks would be to create examples which are augmented using multiple algorithms (or attacks). Such a technique has a critical shortcoming: the model overfits to the perturbations introduced during the training. In turn, EAT only provides a marginal increase in the robustness of the model across attacks in comparison to an unhardened model and hence gives a false sense of security.
In our experiments, we observe that our approach effectively improves the robustness against attacks in comparison to EAT. In particular, on an average across attacks), we see an increase of 75% for MNIST, 77% for SVHN and 57% for CIFAR-10 in the case of white box setup, and an increase of 14.5% for MNIST, 36.25% for SVHN and 7.8% for CIFAR-10 in the case of black box setup

Benefits of using AMC

To the best of our knowledge, ours is the first attempt to provide an end-to-end pipeline to improve robustness against multiple adversarial attacks simultaneously and also multiple kinds of adversaries ; in both white box and black-box settings.

AMC provides robustness against all the attacks it is hardened against, making it an all-in-one defense mechanism against several attacks.
Even though Feature Squeezing has been shown to be bypassed, when used in conjunction with our framework, we observe that Feature Squeezing (quantization in particular) can effectively improve robustness against both attacks seen during training and stronger, unseen attacks.On an average, an absolute increase of 2.65% for MNIST, 8.9% for SVHN and 8.8% for CIFAR-10 for seen attacks, and an absolute increase of 68.9% for MNIST, 56.25% for SVHN and 58.2% for CIFAR-10 on top of AMC is observed.
There is no overhead at inference time. Thus, services can deploy versions of the models hardened with AMC without any compromise on latency. As opposed to most existing adversarial defense methods, our algorithm does not compromise on performance on unperturbed data.

Please find the full paper accepted at IJCNN 2019 for detailed description of our work. This is joint work with Deepak Vijaykeerthy (IBM Research Labs), Sameep Mehta (IBM Research Labs), and PK.