Abstract
Graphical abstract
Keywords
Introduction
Theoretical considerations
Methods for robuster explanations
Experimental analysis
Conclusion
Declaration of Competing Interest
Acknowledgments
Appendix A. Proof of Theorem 1
Appendix B. Relu networks
Appendix C. Interchangeability of softplus
Appendix D. Experimental analysis
Appendix E. Hessian norm approximation
Appendix F. Additional network structures and data sets
Appendix G. Targeted adversarial attacks
Appendix H. Accuracy-robustnes tradeoff
References
abstract
Explanation methods shed light on the decision process of black-box classifiers such as deep neural networks. But their usefulness can be compromised because they are susceptible to manipulations. With this work, we aim to enhance the resilience of explanations. We develop a unified theoretical framework for deriving bounds on the maximal manipulability of a model. Based on these theoretical insights, we present three different techniques to boost robustness against manipulation: training with weight decay, smoothing activation functions, and minimizing the Hessian of the network. Our experimental results confirm the effectiveness of these approaches.
Introduction
In recent years, deep neural networks have revolutionized many different areas. Despite their impressive performance, the reasoning behind their decision processes remains difficult to grasp for humans. This can limit their usefulness in applications that require transparency. Explanation methods promise to make neural networks interpretable.