Weight Agnostic Neural Networks, by Adam Gaier and David Ha, was hailed as a seminal finding in Deep Learning that showed that the architecture of the neural network might matter as much as actually training the network for weights via gradient descent. The authors believe this finding is consistent with the realization that animals tend to be born with complex neural architectures that, previous to any learning, already have the ability to perform complex tasks such as swimming or recognizing predators.
The methodology involved was to use an evolutionary algorithm (based on Kenneth Stanley’s Neuro Evolution of Augmenting Topologies (Links to an external site.) or “NEAT” algorithm) to build out neural architectures with a random weight shared among the nodes. Despite being a random weight, many of these architectures ended up solving some fairly complex Machine Learning problems with fair results. This included successes with MNIST (recognizing handwritten digits), Bipedal Walker, Car Racing, and Swing-Up Cart Pole. For example, the MNIST results were 92% accuracy. (For comparison, in my Deep Learning class I managed to get 97% accuracy using stochastic gradient descent with a two-layer neural network.) Remember this is with a network using only a random weight.
These results are very suggestive that we’ve underestimated the role of the network architecture itself in Deep Learning. The authors of the paper argue that we already knew some architectures have a strong inductive bias towards certain kinds of problems. An obvious example here is the use of Convolutional Neural Networks for visual problems or Recurrent Neural Networks for time series problems. But it would appear that some neural net architectures have such strong inductive biases towards some problems that even with a random weight they can already solve some problems. If you go on to use gradient descent on this network, you can even obtain state of the art results in some cases.
While I think this paper is a stunning result that does challenge how we’ll look at Deep Learning, I felt that the use of a single weight for all nodes reduces the impact of this result. The authors found that they could not replicate the results using a random weight per node. They note that networks often utilized the sign of the weight to obtain its results, thereby destroying the effectiveness of the network if it had a random weight per node. It’s not too hard to see that if you are using a single random weight for your network, then clearly an architecture will evolve that simply uses this value to form logical networks using this value as a sort of seed value. All the intelligence now comes from the evolutionary algorithm that builds this network based on that weight.
Where I think this result is a bit more revolutionary is that the resulting networks are so small by comparison to existing Deep Learning architectures. Deep Learning seems to rely on the fact that the number of weights in the network are often much larger than the number of examples. Zhang found that Deep Neural Networks simply had the power to memorize the training set and so could reduce training set error to zero even on randomly assigned labels. (See Zhang et al’s “Understanding Deep Learning Requires Rethinking Generalization (Links to an external site.)” for discussion.) This does not seem possible with the incredibly terse networks that Gaier’s approach creates. That these networks could then be trained using gradient descent to near state of the art results does suggest that so much of the logic of the function is contained in the network architecture (just as Gaier claims) that most of the time spent in training Deep Neural Networks is probably just wasted time and memory.
The final networks chosen by the evolutionary algorithm favored smaller networks over larger ones. “The Kolmogorov complexity of a computable object is the minimum length of the program that can compute it.” (p. 2) The end result is network architectures that are much smaller than a stereotypical feedforward neural network (or any other current popular architecture) that then, when trained, has near state of the art performance. (At least for some of the problems tried in the paper.) This does seem like a significant breakthrough in how we will understand neural networks.
We essentially use neural networks today as giant random functions that we tweak using weights to somehow give us a good result for the desired function. Since these functions have a lot of weights/parameters there is almost always some set of weights that will give an acceptable result. This paper suggests that we’re being wasteful in that approach. Instead, one could imagine an approach where you search for a minimum architecture and then quickly train the final weights.
However, does this suggest that architecture and weights have equivalent representational power? It would appear not. The results of an architecture search do not seem to give state of the art results on its own. This still requires a final weight search. This suggests that you can get state of the art results in one of two ways: 1) via weight search alone, 2) via architecture search combined with weight search. This seems to leave no doubt that weight search is still the most important aspect of neural networks and is the real representational power found in Artificial Neural Networks.
As Zhang and company found, we still do not really understand how Neural Networks generalize. Neural Networks definitely have the ability to just memorize the dataset. So why do they so often, instead, generalize into a useful function? Zhang’s results on random labels refutes most of the conventional wisdom as to why neural networks generalize. So there is still substantial new research possible to answer these questions.