Parametric studies of translation invariance and distortion robustness in Convolutional Neural Networks
Abstract
Although Convolutional Neural Networks (CNNs) are widely used, their translation in-variance (ability to deal with translated inputs) is still subject to some controversy. We explore this question using translation-sensitivity maps to quantify how sensitive a standard CNN is to a translated input. We propose the use of cosine similarity as sensitivity metric over Euclidean distance, and discuss the importance of restricting the dimensionality of either of these metrics when comparing architectures. Our main focus is to investigate the effect of different architectural components of a stan-dard CNN on that network’s sensitivity to translation. To study the effects of max-pool kernel size on translation invariance, we train several CNN architectures with differently shaped max-pool kernels and compare their translation invariance. The results indicate that larger max-pool kernels result in more translation invariance than smaller max-pool kernels. By varying convolutional kernel sizes and amounts of zero padding, we control the size of the feature maps produced, allowing us to quantify the extent to which these elements influence translation invariance. We also measure translation invariance at different loca-tions within the CNN to determine the extent to which convolutional and fully connected layers, respectively, contribute to the translation invariance of a CNN as a whole. Our analysis indicates that both convolutional kernel size and feature map size have a system-atic influence on translation invariance. We also see that convolutional layers contribute less than expected to translation invariance, when not specifically forced to do so. The effects of various CNN components on distortion-sensitivity is also analysed in this study. We analyse the differences between how CNNs deal with translation and distortion. Using distortion-sensitivity functions that we define, we are able to quantify how sensitive a system is to distorted inputs. The results indicate that larger max-pool kernels result in more distortion-sensitivity for CNNs trained on MNIST, similar to translation invariance. Convolutional kernel size has less of an effect on distortion sensitivity for CNNs trained in MNIST while all networks, regardless of their architectural variations, learn to be less sensitive to distortion when trained on CIFAR10. All in all, it seems that convolutional layers are not fully utilised to deal with spatial information if the training task is not difficult enough, forcing the fully connected layers (spatially unaware) to compensate by learning similar elements at different inputs. By reducing feature map size to 1, forcing the convolutional layers to better deal with trans-lation, we obtain the most translation invariant system studied, when evaluated on the unseen test set. Also, we observe a few similarities in how CNNs deal with translation and distortion and attribute these similarities to the network’s ability to pinpoint impor-tant features in general, rather than specifically the movement of kernel across the input during convolution.
Collections
- Engineering [1418]