This talk presents a variational framework to understand the properties of functions learned by neural networks fit to data. The framework is based on total variation semi-norms defined in the Radon domain, which is naturally suited to the analysis of neural activation functions (ridge functions). Finding a function that fits a dataset while having a small semi-norm is posed as an infinite dimensional variational optimization. We derive a representer theorem showing that finite-width neural networks are solutions to the variational problem. The representer theorem is reminiscent of the classical reproducing kernel Hilbert space representer theorem, but we show that neural networks are solutions in a non-Hilbertian Banach space. While the learning problems are posed in an infinite dimensional function space, similar to kernel methods, they can be recast as finite-dimensional neural network training problems. These neural network training problems have regularizers which are related to the well-known weight decay and path-norm regularizers. Thus, the results provide new insight into functional characteristics of overparameterized neural networks and also into the design neural network regularizers. Our results also provide new theoretical support for a number of empirical findings in deep learning architectures including the benefits of “skip connections”, sparsity, and low-rank structures.
This is joint work with Rahul Parhi.