Generalization in deep learning : bilateral synergies in MLP learning
Abstract
We present an investigation of how simple artificial neural networks (specifcally, feed-forward
networks with full connections between each successive pair of layers) generalize to
out-of-sample data. By emphasizing the substructures formed within these networks we
are able to shed light on several phenomena and relevant open questions in the literature.
Specifically, we show that hidden units with piecewise linear activation functions are optimized
on the train set in a distributed manner, meaning each sub-unit is only optimized
to reduce the loss of a specific sub-population of the train set. This mechanism gives rise
to a type of modularity that is not often considered in investigations of artificial neural
networks and generalization.
We are able to uncover informative regularity in sub-unit behavior and elucidate known
phenomena such as: different artificial neural networks tend to prioritize similar samples,
over-parametization does not necessarily lead to poor generalization, artificial neural networks
are able to interpolate large amounts of noise and still generalize appropriately, and
generalization error as a function of representational capacity undergoes a second descent
beyond the point of interpolation (a.k.a the double descent phenomenon).
We motivate a perspective of generalization in deep learning that is less focused on the
complexity of hypothesis spaces, and looks to substructures and the manner by which
training data is compartmentalized as a method of understanding the observed ability of
these networks to generalize. This perspective contradicts classical ideas of generalization
and complexity under certain conditions.
Collections
- Engineering [1424]