Review of “Weight Agnostic Neural Networks”

Weight Agnostic Neural Networks, by Adam Gaier and David Ha, was hailed as a seminal finding in Deep Learning that showed that the architecture of the neural network might matter as much as actually training the network for weights via gradient descent. The authors believe this finding is consistent with the realization that animals tend to be born with complex neural architectures that, previous to any learning, already have the ability to perform complex tasks such as swimming or recognizing predators.

The methodology involved was to use an evolutionary algorithm (based on Kenneth Stanley’s Neuro Evolution of Augmenting Topologies (Links to an external site.) or “NEAT” algorithm) to build out neural architectures with a random weight shared among the nodes. Despite being a random weight, many of these architectures ended up solving some fairly complex Machine Learning problems with fair results. This included successes with MNIST (recognizing handwritten digits), Bipedal Walker, Car Racing, and Swing-Up Cart Pole. For example, the MNIST results were 92% accuracy. (For comparison, in my Deep Learning class I managed to get 97% accuracy using stochastic gradient descent with a two-layer neural network.) Remember this is with a network using only a random weight.

These results are very suggestive that we’ve underestimated the role of the network architecture itself in Deep Learning. The authors of the paper argue that we already knew some architectures have a strong inductive bias towards certain kinds of problems.  An obvious example here is the use of Convolutional Neural Networks for visual problems or Recurrent Neural Networks for time series problems. But it would appear that some neural net architectures have such strong inductive biases towards some problems that even with a random weight they can already solve some problems. If you go on to use gradient descent on this network, you can even obtain state of the art results in some cases.

While I think this paper is a stunning result that does challenge how we’ll look at Deep Learning, I felt that the use of a single weight for all nodes reduces the impact of this result. The authors found that they could not replicate the results using a random weight per node. They note that networks often utilized the sign of the weight to obtain its results, thereby destroying the effectiveness of the network if it had a random weight per node. It’s not too hard to see that if you are using a single random weight for your network, then clearly an architecture will evolve that simply uses this value to form logical networks using this value as a sort of seed value. All the intelligence now comes from the evolutionary algorithm that builds this network based on that weight.

Where I think this result is a bit more revolutionary is that the resulting networks are so small by comparison to existing Deep Learning architectures. Deep Learning seems to rely on the fact that the number of weights in the network are often much larger than the number of examples. Zhang found that Deep Neural Networks simply had the power to memorize the training set and so could reduce training set error to zero even on randomly assigned labels. (See Zhang et al’s “Understanding Deep Learning Requires Rethinking Generalization (Links to an external site.)” for discussion.) This does not seem possible with the incredibly terse networks that Gaier’s approach creates. That these networks could then be trained using gradient descent to near state of the art results does suggest that so much of the logic of the function is contained in the network architecture (just as Gaier claims) that most of the time spent in training Deep Neural Networks is probably just wasted time and memory.  

The final networks chosen by the evolutionary algorithm favored smaller networks over larger ones. “The Kolmogorov complexity of a computable object is the minimum length of the program that can compute it.” (p. 2) The end result is network architectures that are much smaller than a stereotypical feedforward neural network (or any other current popular architecture) that then, when trained, has near state of the art performance. (At least for some of the problems tried in the paper.) This does seem like a significant breakthrough in how we will understand neural networks.

We essentially use neural networks today as giant random functions that we tweak using weights to somehow give us a good result for the desired function. Since these functions have a lot of weights/parameters there is almost always some set of weights that will give an acceptable result. This paper suggests that we’re being wasteful in that approach. Instead, one could imagine an approach where you search for a minimum architecture and then quickly train the final weights.

However, does this suggest that architecture and weights have equivalent representational power? It would appear not. The results of an architecture search do not seem to give state of the art results on its own. This still requires a final weight search. This suggests that you can get state of the art results in one of two ways: 1) via weight search alone, 2) via architecture search combined with weight search. This seems to leave no doubt that weight search is still the most important aspect of neural networks and is the real representational power found in Artificial Neural Networks.

As Zhang and company found, we still do not really understand how Neural Networks generalize. Neural Networks definitely have the ability to just memorize the dataset. So why do they so often, instead, generalize into a useful function? Zhang’s results on random labels refutes most of the conventional wisdom as to why neural networks generalize. So there is still substantial new research possible to answer these questions.

2 Replies to “Review of “Weight Agnostic Neural Networks””

  1. Really neat paper, Bruce. I’ve seen similar works from a long time ago but nothing recently. I have a few comments:

    — Note that a deep neural network trained with SGD can discover many subnetworks that are useful by setting some weights to 0. So it can also discover good network architectures, in that way. If you look at the “lottery ticket hypothesis” , it is essentially saying that is what SGD does – it creates a bunch of subnetworks and then the best one wins. My intuitions about DNNs leave me skeptical about the LTH but it’s plausible.

    — It’s a bit hard to compare “representational” power of a ReLU based DNN with their method since they allowed a variety of activation functions, some of considerable complexity such as sin(x). I’d be careful using that term as it is a technical term referring to the size of the function space the network can represent.

    — They get a fairly big boost in performance by ensembling *the same* network with multiple different weight values. It’s not clear at all to me why this works and it’s possible this is just a fluke way of introducing noise that reduces overfitting somehow.

    — We need more work comparing evolutionary/genetic algorithms with DNNs. There are some papers showing evolutionary algorithms can discover useful molecules better than DNN reinforcement learning based approaches. That is a very specific example but I’d love to see more head to head comparisons. I expect more interesting work on evolutionary algorithms to be coming out soon, partially because cheap compute is making them more feasible.

    — It is true DNNs can be compressed to some extent (one of the most popular papers showing this is https://arxiv.org/abs/1503.02531). I’m not entirely convinced this method is giving smaller networks but it seems plausible. Sadly it doesn’t seem they mentioned the number of parameters. They are also using a more rich set of activation functions, so that also makes comparison difficult (I wish they had just used ReLU for comparison sake). You mentioned you trained a 2 layer network on MNIST that got 98%. How many parameters did it have?

    1. My two-layer had the following sizes:
      X (64, 784)
      w1, (784, 128)
      w2 (128, 10)
      So if I’m doing my math right, it had 101,632 parameters. (100352 for first layer and 1280 for second layer) Is that correct? I’m still new to Deep Learning. If I’m doing the math wrong, correct me so I can learn.

      And it scored 97% on the test set, not 98%

      The lottery ticket hypothesis makes good sense to me. I note the Popperian / Evolutionary nature of the hypothesis. So I think that’s a good sign it’s worth exploring further. This article looked interesting: https://arxiv.org/abs/2002.00585

      Dan, are you familiar with “Evolution is exponentially more powerful with frequency-dependent selection by Artem Kaznatcheev”?

      You might find that interesting given your comments about ANNs vs genetic algorithms. He produces a genetic algorithm that can solve problems previously considered intractable to GAs by including a concept of ‘ecology’ into the model.

      “They get a fairly big boost in performance by ensembling *the same* network with multiple different weight values. It’s not clear at all to me why this works and it’s possible this is just a fluke way of introducing noise that reduces overfitting somehow.”

      I did not know that. That’s interesting.

Leave a Reply

Your email address will not be published. Required fields are marked *