Activation functions in deep neural networks
The ability of machine learning algorithms to generalize is arguably their most important aspect as it determines their ability to perform appropriately on unseen data. The impres-sive generalization abilities of deep neural networks (DNNs) are not yet well understood. In particular, the inﬂuence of activation functions on the learning process has received limited theoretical attention, even though phenomena such as vanishing gradients, node saturation and sparsity have been identiﬁed as possible contributors when comparing diﬀerent activation functions. In this study, we present ﬁndings based on a comparison of several DNN architectures trained with two popular activation functions, and investigate the eﬀect of these ac-tivation functions on training and generalization. We aim to determine the principal factors that contribute towards the superior generalization performance of rectiﬁed linear networks when compared with sigmoidal networks. We investigate these factors using fully-connected feedforward networks trained on three standard benchmark tasks. We ﬁnd that the most salient diﬀerences between networks trained with these activation functions relate to the way in which class-distinctive information is separated and prop-agated through the network. We ﬁnd that the behavior of nodes in ReLU and sigmoidal networks shows similar regularities in some cases. We also ﬁnd that there are relationships in the ability of hidden layers to accurately use the information available to them and the capacity (speciﬁcally depth and width) of the models. The study contributes towards open questions regarding the generalization performance of deep neural networks, specif-ically giving an informed perspective on the role of two historically popular activation functions.
- Engineering