Activation functions in deep neural networks
Abstract
The ability of machine learning algorithms to generalize is arguably their most important aspect as it determines their ability to perform appropriately on unseen data. The impres-sive generalization abilities of deep neural networks (DNNs) are not yet well understood. In particular, the influence of activation functions on the learning process has received limited theoretical attention, even though phenomena such as vanishing gradients, node saturation and sparsity have been identified as possible contributors when comparing different activation functions. In this study, we present findings based on a comparison of several DNN architectures trained with two popular activation functions, and investigate the effect of these ac-tivation functions on training and generalization. We aim to determine the principal factors that contribute towards the superior generalization performance of rectified linear networks when compared with sigmoidal networks. We investigate these factors using fully-connected feedforward networks trained on three standard benchmark tasks. We find that the most salient differences between networks trained with these activation functions relate to the way in which class-distinctive information is separated and prop-agated through the network. We find that the behavior of nodes in ReLU and sigmoidal networks shows similar regularities in some cases. We also find that there are relationships in the ability of hidden layers to accurately use the information available to them and the capacity (specifically depth and width) of the models. The study contributes towards open questions regarding the generalization performance of deep neural networks, specif-ically giving an informed perspective on the role of two historically popular activation functions.
Collections
- Engineering [1418]